We are going to analyze the DNA sequences
TGTTACGG for similarity.
First, we initialize the first row and column of the matrix to
From top to bottom and left to right, at each nucleotide intersection the two bases are compared and a score is computed based on their compatibility according to the algorithm.
Starting from the highest value, we traverse the matrix recursively to identify the greatest subsequent scores. This identifies the base pairs we will use in our final result.
From the path of the traceback, we identify intersections of the same base which we use to determine our final result. As you can see from the highlighted bases, our final result is:
G T T G A C
| | | | | |
G T T - A C
The algorithm performs local sequence alignment on two substrings of nucleic acids or protein sequences. Local sequence alignment is frequently performed in bioinformatics to "identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences" (Wikipedia).
The Smith-Waterman algorithm was proposed in 1981 as an alternative to the earlier Needleman-Wunsch algorithm with the intention of improving the readability of the scoring matrix.
Here's a summary of how the algorithm works:
"Scores" are assigned to matches and mismatches between base pairs. The exact value of each score is assigned by the researcher and typically will be determined by the particular goal of their research.
A matrix of size
1 + length(sequence1) x
1 + length(sequence2) is created, and
initialized with zeroes in the first row and column.
Each pair of base matches is assigned a final score equal to the highest scoring test against other cells in the matrix (more info on this here).
Starting with the highest final score, the matrix is traversed recursively along the highest-scoring cells until zero is reached. This process identifies the base pairs used in the final result.