DAA: KMP Algorithm

KMP ALGORITHM

The KMP algorithm is abbreviated as the "Knuth Morris Pratt” algorithm. This algorithm was developed by all of them. 

This algorithm searches a pattern of length m in a string text of length n. The application of pattern searching can be seen when we search something on browsers, folders, etc.

The pattern searching can be done using the naive approach, too, but it takes O (m*(n - m + 1)) time complexity in the worst case, which is not efficient. Moreover, the naive approach doesn't work fine in the cases where there are many matching characters and one mismatch.

Examples of some of the cases are:

  text[] = "AAAAAAAAAAAAAAAAAB"
   pattern[] = "AAAAB"
  text[] = "ABACABABABCABABABC"
  pattern[] =  "ABABAC" 

The KMP algorithm works on the idea of degeneration, i.e., a pattern having the same sub-pattern occurring more than once in the pattern. This idea reduces the time complexity to O(n) in the worst case. After some matches, if a mismatch occurs, we already know some of the characters of the pattern in the next window. This will prevent matching the characters again and again.

Let us understand this with the following illustration:

 text = "AAAAABAAABA"
 pattern = "AAAA" 

By Comparing the first window with the pattern we get,

 text = "AAAAABAAABA"
 pat = "AAAA"  [Initial position] 

We found a match. Till here, it works similar to the Naive approach

In the next step, we compare the next window of text with pat.

 text = "AAAAABAAABA"
 pat =  "AAAA" [Pattern shifted one position ahead] 

This is the index where KMP rules over the naive approach. When we iterate in the next window, we see that the last A in the pattern matches with the character in the current window.

We skipped the first three characters because they had previously matched from the first window.

Why Pre-processing?

Pre-processing is done to skip the characters as we have previously found some chunk of the pattern. It can be done by maintaining an array of pattern length so that we can know how many characters are to be skipped in a particular window.

Working on pre-processing method:

  1. In KMP, we pre-process the pattern by creating an array of size patterns, i.e., LP[], which keeps track of the characters to be skipped while matching.
  2. LP here means “longest proper prefix which is also a suffix”. We search for LP in sub-patterns.
  3. For each sub-pattern (pattern [0...i] (i from 0 to m - 1), LP[i] stores the maximum length up to which the pattern is found.

  LP[i] = the longest proper prefix of pattern [0..i]   which is also a suffix of pattern[0..i].

Example of LP construct:

For the pattern “AAAA” : LP[] is [0, 1, 2, 3] because A matches as index increases.

For the pattern “ABCDE” : LP[] is [0, 0, 0, 0, 0]

For the pattern “AABAACAABAA” : LP[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

Now let us understand how searching will work:

Iterate through the pattern and compare all the characters at each chunk. Now the value of LP[] will tell whether the new incoming character will be matched or not. This will benefit in not matching the character for which a match already exists.

  • With an index of j as zero, compare pattern[] with all characters of the text.
  • If the characters of text and pattern keep matching, we keep incrementing i and j.
  • Elsewhen a mismatch occurs then,
    • The match of pattern pattern[0..j-1] occurs uptotext text[i-j…i-1].(Initially,  j starts with 0 and increments it only when there is a match, and i increases as it iterates in the text).
    • From the definition, it is clear that LP[j-1] is a count of characters of the pattern[0…j-1] that are both proper prefixes and suffixes.
    • There will be no matching of LP[j-1] characters with the text[i-j…i-1] as they will match every time.

For example:

 text[] = "AAAAABAAABA" 
 pattern[] = "AAAA"
 LP[] = {0, 1, 2, 3} 
 Start to iterate with i = 0 and j = 0 
 text[] = "AAAAABAAABA" 
 pattern[] = "AAAA"
 text[i] and pattern[j] match, do i++, j++
 i = 1, j = 1
 text[] = "AAAAABAAABA" 
 pattern[] = "AAAA"
 text[i] and pattern[j] match, do i++, j++
 i = 2, j = 2
 text[] = "AAAAABAAABA" 
 pattern[] = "AAAA"
 pattern[i] and pattern[j] match, do i++, j++
 i = 3, j = 3
 text[] = "AAAAABAAABA" 
 pattern[] = "AAAA"
 text[i] and pattern[j] match, do i++, j++
 i = 4, j = 4
 Since j == M, print pattern found and reset j,
 j = LP[j-1] = LP[3] = 3
 The first three characters of the window will not be matched again like we do in the naive algorithmi = 4, j = 3
 text[] = "AAAAABAAABA" 
 pattern[] =  "AAAA"
 text[i] and pattern[j] match, do i++, j++
 i = 5, j = 4
 Since j == M, print pattern found and reset j,
 j = LP[j-1] = LP[3] = 3
 The first three characters of the window will not be matched again like we do in the naive algorithm.
 i = 5, j = 3
 text[] = "AAAAABAAABA" 
 pattern[] =   "AAAA"
 text[i] and pattern[j] do NOT match and j > 0, change only j
 j = LP[j-1] = LP[2] = 2
 i = 5, j = 2
 text[] = "AAAAABAAABA" 
 pattern[] =    "AAAA"
 text[i] and pattern[j] do NOT match and j > 0, change only j
 j = LP[j-1] = LP[1] = 1 
 i = 5, j = 1
 text[] = "AAAAABAAABA" 
 pattern[] =     "AAAA"
 text[i] and pattern[j] do NOT match and j > 0, Decrement j
 j = LP[j-1] = LP[0] = 0
 i = 5, j = 0
 text[] = "AAAAABAAABA" 
 pattern[] =      "AAAA"
 text[i] and pattern[j] do NOT match. Since j is 0 we increment i. 
 i = 6, j = 0
 text[] = "AAAAABAAABA" 
 pattern[] =       "AAAA"
 text[i] and pattern[j] match, do i++ and j++
 i = 7, j = 1
 text[] = "AAAAABAAABA" 
 pattern[] =       "AAAA"
 text[i] and pattern[j] match,  increment i and j.
 Hence, the processing is done.
 

C++ Code:

#include <bits/stdc++.h>
 using namespace std;
 void computeLPSArray(char* pattern, int M, int* LP);
 void KMPSearchAlgo(char* pattern, char* text) // Print the occurences of pattern in text
 {
     int M = strlen(pattern);
     int N = strlen(text);
     int LP[M]; // Create LP array
     computeLPSArray(pattern, M, LP); // Preprocess LP[]
     int i = 0; // index for text[]
     int j = 0; // index for pattern[]
     while (i < N) {
         if (pattern[j] == text[i]) {
             j++;
             i++;
         }
         if (j == M) {
             cout << "Found at index " << i - j << endl;
             j = LP[j - 1];
         }
         else if (i < N && pattern[j] != text[i]) { // mismatch after j matches
             if (j != 0) // No matching of LP[0..LP[j-1]] characters, because they will match anyway
                 j = LP[j - 1];
             else
                 i = i + 1;
         }
     }
 }
 void computeLPSArray(char* pattern, int M, int* LP) // Fills LP[] for given patttern pattern[0..M-1]
 {
     int length = 0;
     LP[0] = 0; // 0th index doesn't have any match
     int i = 1;
     while (i < M) { // Find LP[i] for i = 1 to M-1
         if (pattern[i] == pattern[length]) {
             length++;
             LP[i] = length;
             i++;
         }
         else // (pattern[i] != pattern[len])
         {
             if (length != 0) {
                 length = LP[length - 1];
             }
             else {
                 LP[i] = 0;
                 i++;
             }
         }
     }
 }
 // Main function to run KMP
 int main()
 {
     char text[] = "ABABDABACDABABCABAB";
     char pattern[] = "ABABCABAB";
     KMPSearchAlgo(pattern, text);
     return 0;
 } 

C code:

 #include <stdio.h>
 #include<string.h>
 void computeLPSArray(char* pattern, int M, int* LP);
 void KMPSearchAlgo(char* pattern, char* text) // Print the occurences of pattern in text
 {
             int M = strlen(pattern);
             int N = strlen(text);
             int LP[M];  // Create LP array
             computeLPSArray(pattern, M, LP); // Preprocess LP[]
             int i = 0; // index for text[]
             int j = 0; // index for pattern[]
             while (i < N) {
                         if (pattern[j] == text[i]) {
                                     j++;
                                     i++;
                         }
                         if (j == M) {
                         printf( "Found at index %d",i - j);
                                     j = LP[j - 1];
                         }
                         else if (i < N && pattern[j] != text[i]) 
{    // mismatch after j matches
     if (j != 0)          
 // No matching of LP[0..LP[j-1]] characters, because they will match anyway
                 j = LP[j - 1];
                       else
                          i = i + 1;
                         }
             }
 }
 void computeLPSArray(char* pattern, int M, int* LP) // Fills LP[] for given patttern pattern[0..M-1]
 {
             int length= 0;
             LP[0] = 0; // 0th index doesn't have any match
             int i = 1;
             while (i < M)
             {           // Find LP[i] for i = 1 to M-1
                         if (pattern[i] == pattern[length]) {
                                     length++;
                                     LP[i] = length;
                                     i++;
                         }
                         else // (pattern[i] != pattern[len])
                         {
                                     if (length != 0) {
                                                 length = LP[length - 1];
                                     }
                                     else
                                     {
                                                 LP[i] = 0;
                                                 i++;
                                     }
                         }
             }
 }
 // Main function to run KMP
 int main()
 {
             char text[] = "ABABDABACDABABCABAB";
             char pattern[] = "ABABCABAB";
             KMPSearchAlgo(pattern, text);
             return 0;
 } 

Time complexity:

O(n) in the worst case.