Category Archives: Strings

How to check if a string is a rotation of another string

Given two strings, how to check one is a rotation of the other?

For example consider the strings “abcd”, “cdab”. If we rotate any string twice, you will get the other string.

A simple way of solving this problem is to rotate one string, and see if it matches the other string. 

But we don’t know after how many rotations, it matches with the other string. In worst case, there will be (L-1) rotations, where L is the length of the string. Because If we rotate a string L times, it restores it’s original order. Each time we rotate, we have to check the equality of strings which is O(n) time in worst case. So the overall complexity of this solution is O(n2).

Is there any other simpler technique? Yes!

The trick here is to observe that if we append a string to itself, it contains all of its rotation patterns! 

For example consider the string “abc”, after appending to itself it becomes “abcabc”. Now try to see all it’s rotation patterns like “bca”, “cab” appear in it.


So the problem is simply finding the sub-string in the expanded string. Here is the simple program written in C++/Python.

Minimum number of characters to insert to make a palindrome

Given a string, how do you find the minimum number of characters to insert to make it a palindrome?
For example given a string “abbc”, the number of characters to insert is 2, and the resulting palindrome is “cabbac”
Take another example “abc”, we need to insert two characters at a minimum to make it a palindrome “abcba”
If the given string is already a palindrome, the result is 0.

Let us analyze the problem to find out a solution. Given a string, we have two possible cases
  • Case 1:The first and last characters are equal. In this case, the minimum number of characters to insert depends on the sub-string between the first and last characters. 
    • For example consider the string “abca”, the value depends on the sub-string “bc”
  • case 2:The first and last characters are not equal. In this case, we have to make the first and last characters equal to make it a palindrome. We can do this in two ways. Either we can insert a new character before the first character, or next to the last character. 
    • For example consider “gcc”, to make first and last characters equal, we can insert a “c” first or insert a “g” last. It either becomes “cgcc” or “gccg”. since “gccg” is minimal way of making a palindrome, We just need to insert 1 character.
Let us denote the minimum number of characters to insert as MCI. It can be defined as follows. str is a string of length n.

MCI(str) = MCI( str[1:n-2] ) if str[0] = str[n-1]
         = Min( MCI(str[0:n-2]), MCI(str[1:n-1]))+1 otherwise

A naive recursive implementation is straight forward. To implement dynamic programming solution, we create a a two dimensional table of size n*n.

table[i][j] indicates MCI of the sub-string str[i:j] the table is calculated using the following rules.

table[i][i] is always 0 because a single character is a palindrome.

table[i][j] = table[i+1][j-1] if str[i] = str[j]
            = Min(table[i+1][j], table[i,j-1])+1 otherwise

We construct this table in bottom up manner i.e solve the smaller problems first and use their result to solve bigger problems. Finally table[0][n-1] contains the required value.

Below is the dynamic programming implementation in C++.

Longest common subsequence problem

This is a famous computer science problem with various applications in tools such as file diff, bio informatics etc. For more information read this on wiki.
Given a set of strings, how do we find the longest common sub-sequence(LCS) among them?

For example, LCS{“ATDACG”,”CTDACGA”} is  “TAG”.

In this post, we will look at the algorithm to find the LCS of two strings. To understand the solution, let us walk through a simple example.

Let us consider the strings “ACDG”, “CAG”. We can recursively define the LCS of two strings str1, str2 of lengths n,m as follows.
LCS(str1, str2) = 
    LCS(str1[0:n-2], str2[0:m-2])+str1[n-1] if str1[n-1] = str2[m-1]
    Max(LCS(str1,str2[0:m-2], LCS(str1[0:n-2],str2)) otherwise
Using the above definition
LCS(“ACDG”,”CAG”) = LCS(“ACD”, “CA”) + “G”

This means that since the last character is same, in the sub-problem we can delete it from both the strings.
Further calculations go as follows

LCS(“ACD”, “CA”) = Max( LCS(“AC”, “CA”), LCS(“ACD”,”C”) )
LCS(“ACD”, “C”) = Max( LCS(“AC”, “C”), LCS(“ACD”,””) )

and so on… 

we can see that LCS(“ACD”,”CA”) is “C”; so “CG” is the final output.

If we implement this naive recursion, we repeat the calculation for many sub problems. For efficient implementation we use a 2 dimensional array of size (n+1)*(m+1).The table is calculated using the rules given below which directly follow from the recursive definition given above.

table[i][j]= 0 if either of the string is empty
           = 1 + table[i-1][j-1] if their last characters are equal
           = max(table[i-1][j], table[i][j-1]) otherwise

Here is an example table.

  ∅ B A C B A D
∅ 0 0 0 0 0 0 0
A  0 1 1 1 1 1 1
B  0 1 1 1 2 2 2
A  0 1 2 2 2 3 3
Z  0 1 2 2 2 3 3
D  0 1 2 2 2 3 4
C  0 1 2 3 3 3 4

Here are the python implementations of both approaches. 

   

Number of characters appearing in all given strings

Given a set of n strings, we have to write a program to find out the number of characters appearing in all the strings.

For example consider the following strings
{“India”, “China”, “United states”}, the letters {a,i,n} appear in all the strings.

Let us think of the following solution.
  • We first add all the characters in the first string to a set. 
  • For each string we do the following.
    • For each letter in the set, we check if it can be found the string.
    • If it is not found, we delete it from the set
  • Finally the set contains only the elements which appear in all the given strings.

Here is the C++ code which implements the above approach. For simplicity this program assumes that all the strings contains lower case alphabets only.

Number of deletions needed to make anagrams

Given two strings, how do we find the number of characters to be deleted from both the strings in order to make them anagrams.

Two words are said to be anagrams if one word can be formed by re-arranging the letters of another word. For example the words “TAN” and “ANT” are anagrams.

Given two words “clean” and “lion”, we need to delete 5 characters from both the strings, {c,e,a} from the first and {i,o} from the second.

Let us discuss the approach to solve this problem.

If two words are anagrams, they both have same set of characters along with their counts. So to make two random words as anagrams, we have to delete all the characters that are present in one word but not in the other word.

To do this 

  • We first create the frequency table of each character present in the first word.
  • Next we iterate through each character in the second word, and try to find it in the frequency table.
    • If there is a non zero count for that character, we decrement it.
    • Otherwise we need to delete the character.
  • After processing all the characters in the second word, we need to delete if there are any extra characters present in the table.


Here is the Python program which implements the same.


Longest substring with equal number of 0s and 1s

Given a string consisting of only 0s and 1s, How do we write a program to find the longest sub string with equal number of 0s and 1s in it?

For example consider the string “01001”, the longest string with equal number of 0s and 1s is “1001”.

Let us look at the solution approaches.
Considering 0 as -1 and 1 as +1, at each index we maintain a cumulative “balance” value.
 

Here balance is the difference between the count of 1 and 0 appeared so far.

For example for the example string “01001”, the balance values are as follows

Index:     0   1   2   3   4
Bit:       0   1   0   0   1
Balance:  -1   0  -1  -2  -1

In this balance array, the longest distance between any two equal values gives us the required result.
 

In the above example, balance value “-1” appear at index 0 and index 4. So 4-0 = 4 is the maximum length of the sub string with equal number of 0,1s.
and the sub string is 1001 ( sub string between 1 and 4 indices).

Let us look at two possible approaches which utilizes the above property.


Method 1: Simple O(n2) approach
 
This method iterates through all possible sub strings and finds out which one is the longest with equal number of 0, 1s.
To check if a sub string has equal number of 0,1, we check the balance value at each index. If it is zero, we will update the maximum length if required.

Method 2: Efficient O(n) approach

 
This method needs extra space to run in O(n) time. For a string of length n, the range of balance values are in the range (-n, n); -n when all entries are 0, and n when all entries are 1.

We will create an array of size 2n+1 which acts as a hash table for all possible balance values. This table stores the leftmost index in the array for each balance value.

Since the array is 0 indexed, table[n+b] stores the leftmost index with balance b. At each index we keep on updating the balance value.

  • If there is no entry in the table with the given balance, we store the index for b.
  • If the table contains a previous entry with the same balance value, we check if the sub string between this previous index and the current index is the longest sub string.

Below is the C++ program to which implements the two approaches discussed above.

Longest substring with unique characters

Given a string, how do we write a program to find the length of longest sub-string without any characters repeating in it.

Let us call it a unique sub string. For example consider the string “program” the maximum length of a unique sub-string is 5 for the sub string “ogram”.

A method by checking all possible sub-strings for the given property is obvious, but it is not time efficient. A naive implementation takes O(n3) time.

Let us look at more efficient approach which runs in O(n) time. Let us assume that the string contains all possible 256 ASCII characters.

We take a Boolean array of size 256 to indicate if a character exists or not.

Take two indices ind1 and ind2 initially pointing to the start of the string. As long as we find a unique character at ind2 we keep on incrementing it. Whenever we see a duplicate character , we move ind1 one index beyond the first instance of that character, keeping track of the maximum length of the sub string found so far.

For example when we find ‘r’ for the second time, the indices will be adjusted like the following.
p  r  o  g  r  a  m
      |     |
    ind1   ind2

Here is the C++ implementation.


Finding the longest common prefix from given strings

Given a list of strings, how do we find the longest common prefix of all the strings?

For example consider { “Anomaly”, “Anatomy”, “Analog”, “Angry” } the longest common prefix is “An”.

This is a simple algorithm. Assume the first string as the longest common prefix, and try to match with the other strings and modify the prefix accordingly.

Here is the C++ code which implements the above approach.

Generating the counting sequence

Given an integer N, we have to generate the nth sequence of the following pattern.
  • “1”
  • “11” as the previous item contains one instance of 1.
  • “21” as the previous entry contains two 1’s.
  • “1211” as the previous entry contains one 2 and one 1.
Following is the program which generates such a sequence. It runs in O(n*k) (where k is the length of the sequence) time and it utilizes only constant extra space.

Finding the edit distance between two strings

Given two strings as input, write a program to find the edit distance between them. 
The edit distance is defined as the minimum number of modifications to do on one string to transform it into the other string. A modification can be inserting a character, deleting a character, or replacing an existing character.
For example, consider the two strings “saturday” and “sunday”. The edit distance is 3. There is no way, we can change one string to the other in less than 3 modifications.
The 3 modifications that can be done on “saturday” are
remove ‘a’ –> sturday
remove ‘t’ –> surday
replace ‘r’ with ‘n’ –> sunday

The following is the algorithm based on dynamic programming.

Let us assume string1 is of length N and string2 is of length M. We create a table of size (N+1) X (M+1).

The entry table[i][j] gives the edit distance between prefix of string1 ending with ith character and prefix of string2 ending with jth character.

0 <= i <= N and 0 <= j <= M

We fill this table in a bottom-up manner. This is essentially solving the problem by combining the results of it’s sub-problems. (Dynamic programming). The entries are filled using the following formula

table[0][i] = i
table[i][0] = i

Explanation:
Assume that 0th character is empty, the edit distance between an empty string and a string of length ‘i’ is always ‘i’ because by deleting all the ‘i’ characters, it becomes an empty string.

table[i][j] = table[i-1][j-1] if string1[i] = string2[j]

Explanation:
If string1[i] and string2[j] are equal, we don’t need any modifications. Hence it is as same as table[i-1][j-1]

table[i][j] = Min(table[i-1][j],table[i][j-1],table[i-1][j-1]) + 1

Explanation:
If string1[i] and string2[j] are not equal, we have three choices.

  • Modify the differing character string1[i] to match string2[j]. This requires a cost of table[i-1][j-1] + 1
  • Delete string1[i] and try to match the remaining with string2. This requires a cost of table[i-1][j] + 1
  • Insert string1[i] into string2. This requires  a cost of table[i][j-1] + 1.

Finally table[N][M] contains the required answer.

The time and space complexity for this algorithm is O(n * m).

Following is the C++ implementation of the above algorithm.