This is the second and final half of our two-part series on pattern matching, or string searching algorithms. In the first part, we covered the Knuth-Morris-Pratt (KMP) algorithm and in this segment, we’re going to present a new algorithm that originates from Boyer-Moore. It is currently considered the most efficient and practical algorithm, serving as a benchmark standard.
More Pattern Matching Algorithms: B-M - The Theory (Page 2 of 4 )
We all know that the iterative approach regarding pattern matching is starting from the left and moving throughout the entire string one-by-one examining the characters of the pattern and the source string, performing comparisons, and determining whether a valid solution can be found or not. This kind of code is slow, inefficient, and naïve.
Out-contrasting the classic iterative approach, the Boyer-Moore[B-M] algorithm performs larger jumps on each case of a mismatch. This reduces the overall time required to walk through the source string, because it does fewer comparisons. To intelligently calculate the length of these kinds of jumps, the algorithm must obtain some sort of information regarding the characters in the source and pattern strings.
As mentioned earlier, the algorithm works backwards during the comparison phase. It begins from the right of the pattern and it compares to the characters of the source string. Then, when a mismatch occurs, the algorithm has two possible options to choose from. We call these bad character shift and good suffix shift. The bad character shift is often called occurrence shift, while the latter is known as matching shift.
The most simplistic variation of the Boyer-Moore only does the bad character shift. This requires a processed table that we may also call a jump table. We can build a table like this starting from the last element of the pattern and calculating the distance of each character from the rightmost character. All of the other characters that aren’t present in the pattern string receive the length of the pattern.
The easiest implementation of this allocates a 256 elements (ASCII set) long array/vector and uses a two-pass algorithm once you fill up the entire array with the length of the pattern. Then you move through the pattern string, calculating the characters’ distance from the rightmost character and filling up their appropriate spot.
There is a special case that must be pointed out: when the character that we are examining right now is the last element of the pattern string (or the first during our analysis, since we begin from the end by going from right to the left). In this case, its distance from the rightmost character would be 0, but we consider its second apparition in the pattern.
The bad character shift basically means that it moves the pattern forward by so many positions that are required to match the mismatched character from the source string with the pattern. So it slides the pattern right below the source string until the mismatched character matches the pattern’s current position. If such a character does not exist, then it jumps with the entire length of the pattern. Check out the jump table.
Pattern => a s d f b b s
Jump Table => 6 5 4 3 1 1 5 and every other character has the value of 7
This is why all of the other characters that are not present in the pattern string receive the length of the entire pattern in the preprocessed jump table. On the other hand, if a character is, say, 4 positions from the rightmost character, then this represents the length of the jump in order to have an exact match. In order to preserve the clarity of this article, we’re going to call this the Bad Character Shift Jump Table.
As an extension of this algorithm, the more complex and original version of the B-M algorithm includes the good suffix shift along with the bad character shift. These two combined are why this algorithm is considered a standard for benchmarking and string searching. We need to understand that one too. At first glance, I’m sure you’ll notice some similarities with Knuth-Morris-Pratt’s algorithm.
The preprocessing stage of the Good Suffix Shift Jump Table is a bit more complex. Again we start out from the last element of the pattern string and we create substrings of it by increasing the size one-by-one. Imagine this: “Susan” is the pattern, so we’ll have “n," “an,” “san,” “usan,” and “Susan” as substrings. In short, we’ll have some 'for loop' from 1 to its length, creating substrings from the last i elements of the patterns.
Each of these substrings will be preceded by a mismatch of the character before it. After this, all we have to do is compare these substrings to the original pattern in order to calculate the least number of shifts (to the left) that are required in order for the specific substring of the pattern to match the original pattern string. This preprocessing phase is illustrated in the following table. Check it out!
Pattern => a s d f b b s
Suffixes => 0 1 0 0 0 0 7
(the last element of suffixes array is always the length of the pattern)
Jump Table => 7 7 7 7 7 5 1
Enough with theory, it’s time for us to actually present an implementation.