Development Cycles
  Home arrow Development Cycles arrow Page 2 - Pattern Matching Algorithms Demystified: K...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
DEVELOPMENT CYCLES

Pattern Matching Algorithms Demystified: KMP
By: Barzan "Tony" Antal
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 8
    2008-07-29

    Table of Contents:
  • Pattern Matching Algorithms Demystified: KMP
  • The Theory
  • C Implementation
  • Taking a Break

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Pattern Matching Algorithms Demystified: KMP - The Theory


    (Page 2 of 4 )

    First of all, let me explain why a naïve string searching algorithm isn’t enough for us. It is the easiest solution; we can intuitively guess that it requires two loops. The complexity of each loop is  O(n) and O(m) respectively. Ultimately, since these the loops are nested, the overall complexity of this kind of naïve algorithm is O(n*m). Of course, here we were calculating the worst case scenario.

    The K-M-P algorithm offers a feasible solution to speed up the pattern searching process. Apparently, we need to generate a sort of DFA—deterministic finite automaton. We’re going to call this the prefix array. It is going to help us because it keeps track of the partial matches. Basically, the prefix array will show us the longest words that are the suffix of T and the prefix of P. This is what we call an overlap.

    In a nutshell, K-M-P starts out from left to right and it searches for valid matches. In the case of a mismatch, you can improve the length of the upcoming switches if you eliminate the possibilities of the switches where know beforehand that there’s no point moving there because there won’t be a valid match.

    The simplest example is “abac” and we’re searching for the pattern “ac.” At first, we start out from the left. We find out that “ab” doesn’t match with “ac,” so we need to move to the right. There’s no point moving right by only one position because then we would compare “ba” with “ac.” We know that the final character was “b,” so it cannot compose a valid match regardless of the second character. Instead, we can jump right to the match: “ac” = “ac.

    This is the optimization that goes into the previous Morris-Pratt algorithm. The K-M-P implements this optimization. But in order to efficiently incorporate this into your search algorithm, you need to use finite automatons. You need to generate that said prefix array. Once you’ve got it, then the rest of the searching algorithm becomes easy.

    Before we discuss the implementation, check out the following three examples. T stands for the source string (text) in which we are searching for the pattern called P. The prefix array is abbreviated as PR. Furthermore, you can see each step of the searching algorithm and the total amount of comparisons. In our approach, the first element of the prefix array is 0 (you can find lots of variations that begin with -1).

    T = abababac

    P = ababac

    PR= 001230

    abababac

    ababac = no match

    abababac

    ababac = correct match > print

    abababac

    ababac = no match

    We can easily lead you through the prefix array creation if we consider the above example. You see, the pattern string is “ababac.” When we generate the prefix array, we work only with the pattern. Thus, we can totally ignore the contents of the T string.

    Here’s how we build the prefix array: the first element is always 0, the next element becomes 0, since “a” doesn’t match with “b,” the third element is 1 because “a” matches with “a” (this is the third character in the string), so the maximum length of the prefix substring in the pattern is 1. After this, “ab” does match with “ab” so it becomes 2. Moreover, you can see that “aba” matches “aba,” so the length increases to 3. We continue with “c,” but it does not match the previous “abab” substring, so it becomes 0.

    Enough theory. Let’s see if you could grasp its methodology. Try to understand the following two examples. They are typical examples of the K-M-P algorithm in action.

    T = abababacabac

    P = acabac

    PR= 001012

    abababacabac

    acabac = no match

    abababacabac

    acabac = no match

    abababacabac

    acabac = no match

    abababacabac

    acabac = correct match > print

    abababacabac

    acabac = no match

     

    T = abcabcdabcdeabc

    P = abcdef

    PR= 000000

    abcabcdabcdeabc

    abcdef = no match

    abcabcdabcdeabc

    abcdef = no match

    abcabcdabcdeabc

    abcdef = no match

    abcabcdabcdeabc

    abcdef = no match

    In the next section we’ll see how this algorithm can be coded in C++ programming language. Don’t forget that our implementation will find all of the valid matches, not only the first. If you only need the first match, then just return that position of the valid match in order to quit from the function.

    More Development Cycles Articles
    More By Barzan "Tony" Antal


       · Thanks for reading this article; hope you've found it informative and...
       · thanks 4 the article. really found it informative....
     

    DEVELOPMENT CYCLES ARTICLES

    - Division of Large Numbers
    - Branch and Bound Algorithm Technique
    - Dynamic Programming Algorithm Technique
    - Genetic Algorithm Techniques
    - Greedy Strategy as an Algorithm Technique
    - Divide and Conquer Algorithm Technique
    - The Backtracking Algorithm Technique
    - More Pattern Matching Algorithms: B-M
    - Pattern Matching Algorithms Demystified: KMP
    - Coding Standards
    - A Peek into the Future: Transactional Memory
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - How to Strike a Match







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 6 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek