Pages

Longest Common Subsequence by Dynamic Programming

Given two strings, find their longest common subsequence.

This is a well known programming problem (to read more about it, you could go to this wikipedia page) that is commonly used as introduction to the dynamic programming solving method.

Its simpler version requires to return just the size of the subsequence. Determining the actual result is a bit more complicated.

I have written a few test cases to clarify to myself what the problem is about, and to get driven in the software development. Here are a couple of them (written for the C++ GoogleTest framework):
int lcsSize(const std::string& first, const std::string& second);
std::string lcsStr(const std::string& first, const std::string& second);

TEST(Lcs, CaseSize)
{
  EXPECT_EQ(2, lcsSize("AGCAT", "GAC"));
  EXPECT_EQ(3, lcsSize("ABCDGH", "AEDFHR"));
  EXPECT_EQ(4, lcsSize("AGGTAB", "GXTXAYB"));
}

TEST(Lcs, CaseStr)
{
  EXPECT_EQ("GA", lcsStr("AGCAT", "GAC"));
  EXPECT_EQ("ADH", lcsStr("ABCDGH", "AEDFHR"));
  EXPECT_EQ("GTAB", lcsStr("AGGTAB", "GXTXAYB"));
}
As you would have already get, lcsSize() is the simpler one, and lcsStr() the more complete. Both functions require to generate a matrix where all the subproblem results are stored. If we want to get just the result size, we would just peak the result of the last one (the most right-below element). Otherwise we'll need to navigate the matrix to build the string up:
std::vector<std::vector<int>> lcs(const std::string& lhs, const std::string& rhs)
{
  const unsigned rows = lhs.size() + 1; // 1
  const unsigned cols = rhs.size() + 1;

  std::vector<std::vector<int>> buffer(rows); // 2
  for(unsigned i = 0; i < rows; ++i) // 3
    buffer[i].resize(cols);

  for(unsigned i = 1; i < rows; ++i) // 4
    for(unsigned j = 1; j < cols; ++j)
      buffer[i][j] = (lhs[i-1] == rhs[j-1]) ? // 5
          buffer[i-1][j-1] + 1 : std::max(buffer[i-1][j], buffer[i][j-1]);

  return buffer;
}
1. The matrix is going to have an extra row and column. This is not a strict necessity, but it makes the code more readable.
2. A simple way to implement a matrix is making it a vector of vectors. Notice the double '>' sign, it is legal in C++11 but it would confuse older compilers.
3. The vector of vectors ctor in the line above creates "rows" zero-sized rows. Here I assign them the right size. Remember that each element is initialized with the default value, that is zero.
4. Loop on all the "real" elements to solve the problem using the previous results.
5. To determine the current value, I check if the relative letters in the input string are the same (notice the "-1", it is due because of the fake top/left elements in the matrix). If it is the case, I have found another matching character, so I increase the value. Otherwise the lcs is not increasing, check the current biggest value and use it.

Now it is trivial to get the common subsequence size, it is the value stored in the last visited cell in the matrix:
int lcsSize(const std::string& lhs, const std::string& rhs)
{
  std::vector<std::vector<int>> buffer = lcs(lhs, rhs);
  return buffer[lhs.size()][rhs.size()];
}
Returning the actual subsequence requires some more code:
std::string lcsStr(const std::string& lhs, const std::string& rhs)
{
  std::vector<std::vector<int>> buffer = lcs(lhs, rhs);

  std::pair<unsigned, unsigned> cur { lhs.size(), rhs.size() }; // 1
  std::vector<char> result; // 2
  while(unsigned size = buffer[cur.first][cur.second] > 0) // 3
  {
    if(buffer[cur.first-1][cur.second] < buffer[cur.first][cur.second] &&
        buffer[cur.first][cur.second-1] < buffer[cur.first][cur.second])
    { // 4
      result.push_back(rhs[cur.second-1]);
      --cur.first;
      --cur.second;
      --size;
    }
    else if(buffer[cur.first][cur.second-1] == buffer[cur.first][cur.second]) // 5
      --cur.second;
    else
      --cur.first;
  }

  return std::string(result.rbegin(), result.rend()); // 6
}
1. The starting point is the matrix right bottom.
2. Each time I find a subsequence character, I'll push it here.
3. We know the subsequence length, we can use it to stop looping when we find all its elements.
4. This cell is marked as being relative to a matching character. Save its value and move up and to the left.
5. Otherwise, move in the matrix (to the left or up), until we find another border character.
6. Now is just a matter of building a string that revert the sequence stored in the resulting vector.

No comments:

Post a Comment