Longest Common Subsequence (LCS)

May 21, 2023

The Longest Common Subsequence (LCS) algorithm is a dynamic programming technique that is used to find the longest subsequence shared between two or more strings or sequences. A subsequence is a sequence that can be derived from another sequence by deleting some or no elements without changing the order of the remaining elements. The LCS algorithm is commonly used in computational biology, natural language processing, and data compression.

History and Development

The LCS algorithm was first introduced by Michael O. Rabin and Richard M. Karp in 1957. In their paper “The Efficient Construction of an Unambiguous Algorithm for the Simultaneous Recognition of Two Languages,” they presented a dynamic programming approach to solve the LCS problem. Since then, the algorithm has been extensively studied and developed by various researchers and has become one of the most popular algorithms in computer science.

Key Concepts and Principles

The key concept behind the LCS algorithm is dynamic programming. The algorithm relies on the fact that the LCS of two sequences can be derived from the LCS of their subsequences. To find the LCS of two sequences, the algorithm constructs a matrix with the lengths of the LCS of their subsequences. The matrix is filled using a recursive formula that takes into account the current element of the sequences and the length of the LCS of their previous subsequences.

The LCS algorithm uses the following principles:

  • Dynamic programming: The algorithm breaks down a complex problem into smaller subproblems and builds up the solution from the subproblems.
  • Optimal substructure: The optimal solution to the problem can be derived from the optimal solutions to its subproblems.
  • Memoization: The algorithm stores the results of its computations to avoid redundant calculations.

Pseudocode and Implementation Details

The following is the pseudocode for the LCS algorithm:

function LCS(X, Y):
    m = length(X)
    n = length(Y)
    C = [[0]*(n+1) for _ in range(m+1)]
    for i in range(m+1):
        for j in range(n+1):
            if i == 0 or j == 0:
                C[i][j] = 0
            elif X[i-1] == Y[j-1]:
                C[i][j] = C[i-1][j-1] + 1
            else:
                C[i][j] = max(C[i-1][j], C[i][j-1])
    return C[m][n]

The function LCS takes two sequences X and Y as input and returns the length of their LCS. The algorithm uses a two-dimensional array C to store the lengths of the LCS of their subsequences. The if statement checks if the current element of the sequences is the same or different. If it is the same, the length of the LCS of their previous subsequences is incremented by 1. If it is different, the LCS of the previous subsequences is compared, and the longest is selected.

The time complexity of the LCS algorithm is O(mn), where m and n are the lengths of the input sequences. The space complexity of the algorithm is also O(mn).

Examples and Use Cases

Example 1: Finding the LCS of two strings

Suppose we have two strings: “ABCDGH” and “AEDFHR”. The LCS of these strings is “ADH”. To find the LCS using the LCS algorithm, we apply the following steps:

  1. Construct a matrix with the lengths of the LCS of their subsequences:
  |   | A | E | D | F | H | R |
--|---|---|---|---|---|---|---|
  | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
A | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
B | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
C | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
D | 0 | 1 | 1 | 2 | 2 | 2 | 2 |
G | 0 | 1 | 1 | 2 | 2 | 2 | 2 |
H | 0 | 1 | 1 | 2 | 2 | 3 | 3 |
  1. Trace back the matrix to find the LCS:
  |   | A | E | D | F | H | R |
--|---|---|---|---|---|---|---|
  |   |   |   |   |   |   |   |
A |   | A | A | A | A | A | A |
B |   |   |   |   |   |   |   |
C |   |   |   |   |   |   |   |
D |   |   |   | D |   |   |   |
G |   |   |   |   |   |   |   |
H |   |   |   |   |   | H | H |

The LCS is “ADH”.

Example 2: DNA sequence alignment

The LCS algorithm is commonly used in computational biology to align DNA sequences. Given two DNA sequences, the algorithm can find the longest common subsequence, which represents the most similar parts of the two sequences. This information can be used to identify mutations, genetic variations, and evolutionary relationships between organisms.

Advantages and Disadvantages

Advantages

  • The LCS algorithm is efficient and has a time complexity of O(mn), where m and n are the lengths of the input sequences.
  • The algorithm is easy to implement and can be adapted to various problems.
  • The algorithm can be extended to find the actual LCS, not just its length.

Disadvantages

  • The LCS algorithm only finds the length of the LCS, not the actual LCS. Additional steps are required to trace back the matrix and find the LCS.
  • The algorithm has a space complexity of O(mn), which can be a limitation for large input sequences.
  • The algorithm may not work well for sequences with multiple LCSs or when the LCS is not unique.

Longest Increasing Subsequence (LIS)

The Longest Increasing Subsequence (LIS) algorithm is a variation of the LCS algorithm that finds the longest subsequence of a sequence that is increasing. The LIS algorithm uses a similar dynamic programming approach to the LCS algorithm, but with a different recursive formula that takes into account the current and previous elements of the sequence. The LIS algorithm is used in various applications, such as data analysis, finance, and optimization.

Edit Distance

The Edit Distance algorithm, also known as Levenshtein Distance, is a variant of the LCS algorithm that measures the difference between two strings in terms of the minimum number of operations required to transform one string into the other. The operations can be insertions, deletions, or substitutions of characters. The Edit Distance algorithm is commonly used in natural language processing, spell-checking, and DNA sequence alignment.