Class Notes:

Sequence Similarity: a quantification of the degree to which two sequences match one another. To determine sequence similarity two sequences must be aligned, and a score determined for the match

We assume that….Similar sequence will mean similar function

Use similar sequence to support common ancestry: Homology

Homologs: Genes or proteins derived from a common ancestor identified based on an level of sequence similarity followed by further analysis. This is a yes/no thing (there is no % homology)

Paralogs: homologous genes produced by gene duplication events. From same ancestor. 2 copies

Orthologs: Homologous genes produced by speciation

Why care?

  • To infer evolutionary relationships
  • To annotate genome
  • To predict the structure of a protein
  • To conduct medical research

As a bioinformatician you approach the search for homology by sequence analysis

The General Process

  1. Conduct multiple pairwise local alignments
  2. Evaluate the quality of alignments by calculating alignments scores. Alignment scoring uses algorithms based on understanding evolution
  3. Do statistical analysis of the alignment scores factoring in the length of the QUERY sequence, and the size of the “search space” (database queried against, which contains potential matches/subjects)
  4. Evaluate the results using statistical and biological knowledge

What is BLAST?

  • Like google and the internet
  • The Basic Local Alignment Search Tool

BLAST conducts pairwise local alignments

Local Alignment: looks for similarity in discrete top

Global Alignment- Has gaps. Looks across entire sequence for similarity. (Forces fit over entire sequence) Below

How BLAST works

  • Spit QUERY sequence into “words” of defined length (W). the number of possible words will be L (sequence length)- W+1
  • Determine all possible one letter variation of the word list and SCORE how those would align to QUERY sequence using substitution matrix
  • About 50 matches (new words) are kept for each of the words generated

Building a set of data, before it goes into the database

Next, the Query word list is used against all sequences in the chosen database, wherever there are exact matches between words and subject sequences the subject sequence is “pulled out” for further analysis

-You compare your sequence to everything on the database, possibility of getting a match

-Bigger the data base the more of a chance of getting a match

-Longer the query, less chance of a random match and vice versa

-Align then score the quality of match between sequences

How do you know if words matches a subject?

-Alignment and scoring-Substitution matrices and scoring alignments

The S cutoff

  • BLAST next randomly “grabs” a set of sequences from the databases that are of the same length as the Query regions that were just defined
  • It aligns the Query against each of these randoms sequences and calculates a score for each
  • From these scores it determines a representative score that is what one would typically see for an alignment hat was simply random and chooses an S value that is GREATER
  • The E value is the number of distinct alignments, with a score equivalent


Online Reading:

“The basic Local alignment search tool”

-Similarity searching and sequence comparison is a tool used by many biologists

-performs comparisons between pairs of sequences, searching for regions of local similarity

-NCBI BLAST and WI-BLAST share common ancestry

Sequence identity- the occurrence of exactly the same nucleotide or amino acid in the same position in aligned sequences

Sequence homology- Value indicates identity and/or similarity and does not always reflect an evolutionary relationship

-Proper alignment must be determined to find similarity of two sequences. In order to answer the questions a means of scoring matches and mismatched, a means of scoring gap, and a method of using the two to evaluate numerous possible alignments

-When evaluating, requires a scoring matrix or table of values that describe probability of biologically meaningful amino acid or nucleotide residu-pair occurring in an alignment

-All matches (+1 or +5) mismatches (-1 or -4)

-Objective is to provide a relatively heavily penalty for aligning two residues together if they have a low probability of being homologous

Two major forces that drive the amino acid substitution rates away from uniformly: not all substitutions occur with the same frequency and some substitutions are less functionally tolerated then others and therefore selected against

Gap penalties: the need to introduce gaps into one or both sequences in order to produce a proper alignment

-Penalty for the creation of a gap should be a large enough that gaps are introduced only where needed, and the penalty for extending a gap should take into account the likelihood that insertions and deletions occur over several residues at a time

-Dynamic Programming a time saving shortcut, developed in the 1950s. The overall problem is determining the optimal alignment of two sequences. It is broken down into smaller and smaller alignment of a single residue from one sequence with a single residue from the other sequence. the solution is taken from the scoring matrix

Two primary methods for talking even shorter shortcuts by approximating the best local alignment: FASTA and BLAST. They are both similar operate on the assumption that true matches are likely to have a least some short stretches of high-scoring similarity–BLAST uses a scoring matrix- BLOSUM62 for amino acid sequences

-The high scoring “hits” are used as seeds for the slower dynamic programming algorithm

-BLAST performs pre-processing of query sequence- to filter out low-complexity regions (such as CA repeats) and to discard words not likely to form high scoring pairs

-BLAST has the potential to miss significant similarities present in the database. It has better accuracy and wide acceptance as the standard

-Comparing segment pairs the same length and BLAST performs identification of all areas of similar segments whos scores exceeds a given threshold

Resulting pairs of similar segments- High scoring segment pairs (HSPs)

Segment pair with the highest score is called the maximal-scoring pair (MSP)

BLAST works in steps:

Step 1- Filters low complexity regions and removes them from the query sequence

Step 2- Searches through the target sequence database for exact matches to the word list generated

Step 3- The original BLAST method tried to extend the alignment from the matching words in both directions as long as the score continued to increase

-Then BLAST determines whether each score found is greater in value then a given cutoff score S, determined empirically by examining the range of scores given by comparing random sequences and then choosing a value that is significantly greater

Blast Off!

-BLAST splits query into several files and is comparing them independently in order to generate different results regarding taxonomy of the organism, structure, protein domain or sequence title

-Non-reductant (nr) most common option includes all major existing databases. Can chose it against which will reveal more detailed info about the database you are heading to










Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s