Longest common subsequence problem

This is a famous computer science problem with various applications in tools such as file diff, bio informatics etc. For more information read this on wiki.

Given a set of strings, how do we find the longest common sub-sequence(LCS) among them?

For example, LCS{“ATDACG”,”CTDACGA”} is “TAG”.

In this post, we will look at the algorithm to find the LCS of two strings. To understand the solution, let us walk through a simple example.

Let us consider the strings “ACDG”, “CAG”. We can recursively define the LCS of two strings str1, str2 of lengths n,m as follows.

LCS(str1, str2) =

LCS(str1[0:n-2], str2[0:m-2])+str1[n-1] if str1[n-1] = str2[m-1]

Max(LCS(str1,str2[0:m-2], LCS(str1[0:n-2],str2)) otherwise

Using the above definition

LCS(“ACDG”,”CAG”) = LCS(“ACD”, “CA”) + “G”

This means that since the last character is same, in the sub-problem we can delete it from both the strings.

Further calculations go as follows

LCS(“ACD”, “CA”) = Max( LCS(“AC”, “CA”), LCS(“ACD”,”C”) )
LCS(“ACD”, “C”) = Max( LCS(“AC”, “C”), LCS(“ACD”,””) )

and so on…

we can see that LCS(“ACD”,”CA”) is “C”; so “CG” is the final output.

If we implement this naive recursion, we repeat the calculation for many sub problems. For efficient implementation we use a 2 dimensional array of size (n+1)*(m+1).The table is calculated using the rules given below which directly follow from the recursive definition given above.

table[i][j]= 0 if either of the string is empty
= 1 + table[i-1][j-1] if their last characters are equal
= max(table[i-1][j], table[i][j-1]) otherwise

Here is an example table.

  ∅ B  A  C  B  A  D

∅ 0 0 0 0 0 0 0
A  0  1  1  1  1  1 1

B  0 1  1  1  2  2  2

A  0 1  2  2  2  3  3

Z  0 1  2  2  2  3  3

D  0 1  2  2  2  3  4

C  0 1  2  3  3  3  4

Here are the python implementations of both approaches.

	#recursive function to find LCS; exponential time complexity O(2^n)
	def lcs_recursive(input1, input2):
	n = len(input1)
	m = len(input2)
	if n == 0 or m == 0:
	return ""
	else:
	if input1[n-1] == input2[m-1]:
	return lcs_recursive(input1[:n-1], input2[:m-1]) + input1[n-1]
	else:
	p1 = lcs_recursive(input1[:n-1], input2)
	p2 = lcs_recursive(input1, input2[:m-1])
	if len(p1) > len(p2):
	return p1
	else:
	return p2

	#Dynamic programming implementation of LCS : Time complexity O(n*m)
	def lcs(input1, input2):
	n = len(input1)
	m = len(input2)
	#create a 2D array to hold intermediate results of sub problems
	table = [[0 for i in range(m+1)] for j in range(n+1)]

	#calculate the length of longest common sub sequence
	for i in range(n+1):
	for j in range(m+1):
	if i == 0 or j == 0:
	table[i][j] = 0
	elif input1[i-1] == input2[j-1]:
	table[i][j] = table[i-1][j-1]+1
	else:
	table[i][j] = max(table[i-1][j],table[i][j-1])
	# Now table[n][m] has the max length of common subsequence

	#find the actual sequence
	i = n
	j = m
	result = ''
	while i > 0 and j > 0:
	if input1[i-1] == input2[j-1]:
	result = input1[i-1] + result
	i = i - 1
	j = j - 1
	elif table[i-1][j] > table[i][j-1]:
	i = i - 1
	else:
	j = j - 1

	return result

	#Testing code
	str1 = "abcg"
	str2 = "acdeg"
	print lcs(str1,str2)

	str1 = "abc"
	str2 = "def"
	print lcs(str1,str2)

	str1 = ""
	str2 = "abc"
	print lcs(str1,str2)

	str1 = "ATDACG"
	str2 = "CTGADGA"
	print lcs_recursive(str1,str2)

view raw LCS.py hosted with

by GitHub

Leave a Reply Cancel reply