More Pattern Matching Algorithms: B-M - Implementation
(Page 3 of 4 )
Before we begin, let me point out that there are hundreds of implementations of the Boyer-Moore algorithms, just as there are with other typical algorithms of computer science. Every now and then, a totally different implementation of it may occur.
Ultimately, it all comes down to whether it works or not. If you do not increase its complexity, either in space or in time, then the overall efficiency of the implementations will still be about the same. And if resource and time efficiency is that important to you, then you should already have basic knowledge in calculating, doing the tests, and mathematically proving its feasibility.
Here in this article we present a somewhat simple implementation that focuses mostly on its efficient and compact code. It is written in a style that it is easily understood. The presented code is C/C++. And at the end of this article you can download the entire source code of this Boyer-Moore algorithm. We will also exemplify its execution.
We are going to have three different procedures, one that represents the main B-M algorithm and the other two will illustrate the two preprocessing phases: the good suffix shift and the bad character shift jump tables. As soon as we have generated these two jump tables, we can run the Boyer-Moore algorithm.
void BadCharTable(string x, int m, int *tableBC)
{ // generates the Bad Character Shift Jump Table
for (int i = 0; i < 256; i++)
tableBC[i] = m;
for (int i = 0; i < m - 1; i++)
tableBC[x[i]] = m - i - 1;
printf("\nBad Character Shift Jump Table:\n");
for (int i = 0; i < m; i++)
printf("%d ", tableBC[x[i]]);
printf("\nEvery other character has %d value.\n", m);
}
Now comes the second preprocessing procedure:
void GoodSuffixTable(string x, int m, int *tableGS)
{ // generates the Good Suffix Shift Jump Table
int i, j, f, *suff;
suff = (int *)malloc(m * sizeof(int));
suff[m - 1] = m;
// note: on the last position of suff array the length of the pattern string is stored
// (the last character of the pattern serving as a suffix matches the string being slid with the entire length of the pattern string)
int g = m - 1; // beginning of the suffix table generation
for (int i = m - 2; i >= 0; i--)
{
if (i > g && (suff[i + m - 1 - f] < i - g))
suff[i] = suff[i + m - 1 - f];
else
{
if (i < g)
g = i;
f = i;
while (g >= 0 && x[g] == x[g + m - 1 - f])
g--;
suff[i] = f - g;
}
} // end of the suffix table generation
// we print out the content of the table for exemplification
printf("\nSuffixes Array: \n");
for (int i = 0; i < m; i++)
printf("%d ", suff[i]);
for (i = 0; i < m; i++)
tableGS[i] = m;
// we fill this jump table with the length of the pattern
j = 0;
for (i = m - 1; i >= 0; i--)
if (suff[i] == i + 1)
while (j < m - 1 - i)
{
if (tableGS[j] == m)
tableGS[j] = m - 1 - i;
j++;
}
for (i = 0; i <= m - 2; i++)
tableGS[m - 1 - suff[i]] = m - 1 - i;
printf("\nGood Suffix Shift Jump Table:\n");
for (i = 0; i < m; i++)
printf("%d ", tableGS[i]);
}
And here comes the main Boyer-Moore procedure:
void BoyerMoore(string x, int m, string y, int n)
{ // the typical BoyerMoore algorithm implementation
int i, j = 0, tableBC[256], tableGS[256];
// preprocessing procedures - we generate the jump tables
GoodSuffixTable(x, m, tableGS);
BadCharTable(x, m, tableBC);
// searching phase
while (j <= n - m)
{
for (i = m - 1; i >= 0 && x[i] == y[i + j]; i--);
// if the index of the above loop reaches below zero, it is an exact match
if (i < 0)
{ // if exact match, write it out
printf("Matches on %d position.\n", j);
j += tableGS[0];
}
else
// not a match, let's jump further (to the maximum between the values in the jump tables)
j += max(tableGS[i], tableBC[y[i + j]] - m + 1 + i);
}
}
Please note that this is not the best possible implementation of the B-M algorithm in terms of speed. Therefore, it shouldn’t be used in comparisons with other pattern matching algorithms. If that’s what you want, then you may want to “speed up” the two preprocessing phases. Bottom-line is that it works, just don't take its execution as the fastest or the most efficient.
Needless to say, the elegance and simplicity of this implementation of the algorithm is to be appreciated the most. I know for a fact that this implementation (accompanied with the source codes) is being taught and tutored at various university courses, which is the reason why I picked this one.
As you may have already noticed, after each preprocessing stage we print out the content of the specific jump table. This way we can “visually” follow what’s going on “behind the scenes.” I have also thoroughly commented the snippets.
Next: Final Words >>
More Development Cycles Articles
More By Barzan "Tony" Antal