Chances are that if you've worked with strings then you have also tried to locate a specific substring in the entire source string. In computer science this is called pattern matching or string matching. And to accomplish this task, there are a few classic algorithms. In this two-part series we’re going to discuss string searching algorithms. We will start with Knuth-Morris-Pratt.
Pattern Matching Algorithms Demystified: KMP - The Theory (Page 2 of 4 )
First of all, let me explain why a naïve string searching algorithm isn’t enough for us. It is the easiest solution; we can intuitively guess that it requires two loops. The complexity of each loop is O(n) and O(m) respectively. Ultimately, since these the loops are nested, the overall complexity of this kind of naïve algorithm is O(n*m). Of course, here we were calculating the worst case scenario.
The K-M-P algorithm offers a feasible solution to speed up the pattern searching process. Apparently, we need to generate a sort of DFA—deterministic finite automaton. We’re going to call this the prefix array. It is going to help us because it keeps track of the partial matches. Basically, the prefix array will show us the longest words that are the suffix of T and the prefix of P. This is what we call an overlap.
In a nutshell, K-M-P starts out from left to right and it searches for valid matches. In the case of a mismatch, you can improve the length of the upcoming switches if you eliminate the possibilities of the switches where know beforehand that there’s no point moving there because there won’t be a valid match.
The simplest example is “abac” and we’re searching for the pattern “ac.” At first, we start out from the left. We find out that “ab” doesn’t match with “ac,” so we need to move to the right. There’s no point moving right by only one position because then we would compare “ba” with “ac.” We know that the final character was “b,” so it cannot compose a valid match regardless of the second character. Instead, we can jump right to the match: “ac” = “ac.”
This is the optimization that goes into the previous Morris-Pratt algorithm. The K-M-P implements this optimization. But in order to efficiently incorporate this into your search algorithm, you need to use finite automatons. You need to generate that said prefix array. Once you’ve got it, then the rest of the searching algorithm becomes easy.
Before we discuss the implementation, check out the following three examples. T stands for the source string (text) in which we are searching for the pattern called P. The prefix array is abbreviated as PR. Furthermore, you can see each step of the searching algorithm and the total amount of comparisons. In our approach, the first element of the prefix array is 0 (you can find lots of variations that begin with -1).
T = abababac
P = ababac
PR= 001230
abababac
ababac = no match
abababac
ababac = correct match > print
abababac
ababac = no match
We can easily lead you through the prefix array creation if we consider the above example. You see, the pattern string is “ababac.” When we generate the prefix array, we work only with the pattern. Thus, we can totally ignore the contents of the T string.
Here’s how we build the prefix array: the first element is always 0, the next element becomes 0, since “a” doesn’t match with “b,” the third element is 1 because “a” matches with “a” (this is the third character in the string), so the maximum length of the prefix substring in the pattern is 1. After this, “ab” does match with “ab” so it becomes 2. Moreover, you can see that “aba” matches “aba,” so the length increases to 3. We continue with “c,” but it does not match the previous “abab” substring, so it becomes 0.
Enough theory. Let’s see if you could grasp its methodology. Try to understand the following two examples. They are typical examples of the K-M-P algorithm in action.
T = abababacabac
P = acabac
PR= 001012
abababacabac
acabac = no match
abababacabac
acabac = no match
abababacabac
acabac = no match
abababacabac
acabac = correct match > print
abababacabac
acabac = no match
T = abcabcdabcdeabc
P = abcdef
PR= 000000
abcabcdabcdeabc
abcdef = no match
abcabcdabcdeabc
abcdef = no match
abcabcdabcdeabc
abcdef = no match
abcabcdabcdeabc
abcdef = no match
In the next section we’ll see how this algorithm can be coded in C++ programming language. Don’t forget that our implementation will find all of the valid matches, not only the first. If you only need the first match, then just return that position of the valid match in order to quit from the function.