< Back

Smith-Waterman Algorithm

Interactive Example  See the Code

1. Initialize Matrix

We are going to analyze the DNA sequences GGTTGACTA and TGTTACGG for similarity.

First, we initialize the first row and column of the matrix to 0.

2. Score Matrix

From top to bottom and left to right, at each nucleotide intersection the two bases are compared and a score is computed based on their compatibility according to the algorithm.

3. Traceback

Starting from the highest value, we traverse the matrix recursively to identify the greatest subsequent scores. This identifies the base pairs we will use in our final result.

4. Final Result

From the path of the traceback, we identify intersections of the same base which we use to determine our final result. As you can see from the highlighted bases, our final result is:

G T T G A C
| | | | | |
G T T - A C

The Algorithm

The algorithm performs local sequence alignment on two substrings of nucleic acids or protein sequences. Local sequence alignment is frequently performed in bioinformatics to "identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences" (Wikipedia).

The Smith-Waterman algorithm was proposed in 1981 as an alternative to the earlier Needleman-Wunsch algorithm with the intention of improving the readability of the scoring matrix.

Here's a summary of how the algorithm works:

  1. Define substitution matrix and gap penalties
  2. "Scores" are assigned to matches and mismatches between base pairs. The exact value of each score is assigned by the researcher and typically will be determined by the particular goal of their research.

  3. Initialize the scoring matrix
  4. A matrix of size  1 + length(sequence1)  x  1 + length(sequence2)  is created, and initialized with zeroes in the first row and column.

  5. Score pairs and fill matrix
  6. Each pair of base matches is assigned a final score equal to the highest scoring test against other cells in the matrix (more info on this here).

  7. Traceback
  8. Starting with the highest final score, the matrix is traversed recursively along the highest-scoring cells until zero is reached. This process identifies the base pairs used in the final result.