Home Development Cycles Page 2 - Pattern Matching Algorithms Demystified: KMP
 DEVELOPMENT CYCLES RSS

# Pattern Matching Algorithms Demystified: KMP

Chances are that if you've worked with strings then you have also tried to locate a specific substring in the entire source string. In computer science this is called pattern matching or string matching. And to accomplish this task, there are a few classic algorithms. In this two-part series we’re going to discuss string searching algorithms. We will start with Knuth-Morris-Pratt.

Author Info:
Rating:  / 19
July 29, 2008
 TABLE OF CONTENTS:

print this article
SEARCH DEVARTICLES

 Pattern Matching Algorithms Demystified: KMP - The Theory (Page 2 of 4 )First of all, let me explain why a naïve string searching algorithm isn’t enough for us. It is the easiest solution; we can intuitively guess that it requires two loops. The complexity of each loop is  O(n) and O(m) respectively. Ultimately, since these the loops are nested, the overall complexity of this kind of naïve algorithm is O(n*m). Of course, here we were calculating the worst case scenario.The K-M-P algorithm offers a feasible solution to speed up the pattern searching process. Apparently, we need to generate a sort of DFA—deterministic finite automaton. We’re going to call this the prefix array. It is going to help us because it keeps track of the partial matches. Basically, the prefix array will show us the longest words that are the suffix of T and the prefix of P. This is what we call an overlap.In a nutshell, K-M-P starts out from left to right and it searches for valid matches. In the case of a mismatch, you can improve the length of the upcoming switches if you eliminate the possibilities of the switches where know beforehand that there’s no point moving there because there won’t be a valid match.The simplest example is “abac” and we’re searching for the pattern “ac.” At first, we start out from the left. We find out that “ab” doesn’t match with “ac,” so we need to move to the right. There’s no point moving right by only one position because then we would compare “ba” with “ac.” We know that the final character was “b,” so it cannot compose a valid match regardless of the second character. Instead, we can jump right to the match: “ac” = “ac.”This is the optimization that goes into the previous Morris-Pratt algorithm. The K-M-P implements this optimization. But in order to efficiently incorporate this into your search algorithm, you need to use finite automatons. You need to generate that said prefix array. Once you’ve got it, then the rest of the searching algorithm becomes easy.Before we discuss the implementation, check out the following three examples. T stands for the source string (text) in which we are searching for the pattern called P. The prefix array is abbreviated as PR. Furthermore, you can see each step of the searching algorithm and the total amount of comparisons. In our approach, the first element of the prefix array is 0 (you can find lots of variations that begin with -1).T = abababacP = ababacPR= 001230abababacababac = no matchabababacababac = correct match > printabababacababac = no matchWe can easily lead you through the prefix array creation if we consider the above example. You see, the pattern string is “ababac.” When we generate the prefix array, we work only with the pattern. Thus, we can totally ignore the contents of the T string.Here’s how we build the prefix array: the first element is always 0, the next element becomes 0, since “a” doesn’t match with “b,” the third element is 1 because “a” matches with “a” (this is the third character in the string), so the maximum length of the prefix substring in the pattern is 1. After this, “ab” does match with “ab” so it becomes 2. Moreover, you can see that “aba” matches “aba,” so the length increases to 3. We continue with “c,” but it does not match the previous “abab” substring, so it becomes 0.Enough theory. Let’s see if you could grasp its methodology. Try to understand the following two examples. They are typical examples of the K-M-P algorithm in action.T = abababacabacP = acabacPR= 001012abababacabacacabac = no matchabababacabacacabac = no matchabababacabacacabac = no matchabababacabacacabac = correct match > printabababacabacacabac = no match T = abcabcdabcdeabcP = abcdefPR= 000000abcabcdabcdeabcabcdef = no matchabcabcdabcdeabcabcdef = no matchabcabcdabcdeabcabcdef = no matchabcabcdabcdeabcabcdef = no matchIn the next section we’ll see how this algorithm can be coded in C++ programming language. Don’t forget that our implementation will find all of the valid matches, not only the first. If you only need the first match, then just return that position of the valid match in order to quit from the function. Next: C Implementation >> Please enable JavaScript to view the comments powered by Disqus.blog comments powered by Disqus
 Weekly Newsletter Developer Updates Free Website Content

© 2003-2018 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap