This is the second and final half of our two-part series on pattern matching, or string searching algorithms. In the first part, we covered the Knuth-Morris-Pratt (KMP) algorithm and in this segment, we’re going to present a new algorithm that originates from Boyer-Moore. It is currently considered the most efficient and practical algorithm, serving as a benchmark standard.

More Pattern Matching Algorithms: B-M - Implementation (Page 3 of 4 )

Before we begin, let me point out that there are hundreds of implementations of the Boyer-Moore algorithms, just as there are with other typical algorithms of computer science. Every now and then, a totally different implementation of it may occur.

Ultimately, it all comes down to whether it works or not. If you do not increase its complexity, either in space or in time, then the overall efficiency of the implementations will still be about the same. And if resource and time efficiency is that important to you, then you should already have basic knowledge in calculating, doing the tests, and mathematically proving its feasibility.

Here in this article we present a somewhat simple implementation that focuses mostly on its efficient and compact code. It is written in a style that it is easily understood. The presented code is C/C++. And at the end of this article you can download the entire source code of this Boyer-Moore algorithm. We will also exemplify its execution.

We are going to have three different procedures, one that represents the main B-M algorithm and the other two will illustrate the two preprocessing phases: the good suffix shift and the bad character shift jump tables. As soon as we have generated these two jump tables, we can run the Boyer-Moore algorithm.

void BadCharTable(string x, int m, int *tableBC)

{ // generates the Bad Character Shift Jump Table

for (int i = 0; i < 256; i++)

tableBC[i] = m;

for (int i = 0; i < m - 1; i++)

tableBC[x[i]] = m - i - 1;

printf("\nBad Character Shift Jump Table:\n");

for (int i = 0; i < m; i++)

printf("%d ", tableBC[x[i]]);

printf("\nEvery other character has %d value.\n", m);

}

Now comes the second preprocessing procedure:

void GoodSuffixTable(string x, int m, int *tableGS)

{ // generates the Good Suffix Shift Jump Table

int i, j, f, *suff;

suff = (int *)malloc(m * sizeof(int));

suff[m - 1] = m;

// note: on the last position of suff array the length of the pattern string is stored

// (the last character of the pattern serving as a suffix matches the string being slid with the entire length of the pattern string)

int g = m - 1; // beginning of the suffix table generation

for (int i = m - 2; i >= 0; i--)

{

if (i > g && (suff[i + m - 1 - f] < i - g))

suff[i] = suff[i + m - 1 - f];

else

{

if (i < g)

g = i;

f = i;

while (g >= 0 && x[g] == x[g + m - 1 - f])

g--;

suff[i] = f - g;

}

} // end of the suffix table generation

// we print out the content of the table for exemplification

printf("\nSuffixes Array: \n");

for (int i = 0; i < m; i++)

printf("%d ", suff[i]);

for (i = 0; i < m; i++)

tableGS[i] = m;

// we fill this jump table with the length of the pattern

j = 0;

for (i = m - 1; i >= 0; i--)

if (suff[i] == i + 1)

while (j < m - 1 - i)

{

if (tableGS[j] == m)

tableGS[j] = m - 1 - i;

j++;

}

for (i = 0; i <= m - 2; i++)

tableGS[m - 1 - suff[i]] = m - 1 - i;

printf("\nGood Suffix Shift Jump Table:\n");

for (i = 0; i < m; i++)

printf("%d ", tableGS[i]);

}

And here comes the main Boyer-Moore procedure:

void BoyerMoore(string x, int m, string y, int n)

{ // the typical BoyerMoore algorithm implementation

int i, j = 0, tableBC[256], tableGS[256];

// preprocessing procedures - we generate the jump tables

GoodSuffixTable(x, m, tableGS);

BadCharTable(x, m, tableBC);

// searching phase

while (j <= n - m)

{

for (i = m - 1; i >= 0 && x[i] == y[i + j]; i--);

// if the index of the above loop reaches below zero, it is an exact match

if (i < 0)

{ // if exact match, write it out

printf("Matches on %d position.\n", j);

j += tableGS[0];

}

else

// not a match, let's jump further (to the maximum between the values in the jump tables)

Please note that this is not the best possible implementation of the B-M algorithm in terms of speed. Therefore, it shouldn’t be used in comparisons with other pattern matching algorithms. If that’s what you want, then you may want to “speed up” the two preprocessing phases. Bottom-line is that it works, just don't take its execution as the fastest or the most efficient.

Needless to say, the elegance and simplicity of this implementation of the algorithm is to be appreciated the most. I know for a fact that this implementation (accompanied with the source codes) is being taught and tutored at various university courses, which is the reason why I picked this one.

As you may have already noticed, after each preprocessing stage we print out the content of the specific jump table. This way we can “visually” follow what’s going on “behind the scenes.” I have also thoroughly commented the snippets.