Tame the Beast by Matching Similar Strings - Similarity Ranking Methods
(Page 5 of 7 )
Similarity ranking methods compare a given string to a set of strings and rank those strings in order of similarity. To produce a ranking, we need a way of saying that one match is better than another. This is done by returning a numeric measure of similarity as the result of each comparison. Alternatively, you can think of the distance between two strings, instead of their similarity. Strings with a large distance between them have low similarity, and vice versa.
Two very common methods for ranking similarity are the Longest Common Sub-string and Edit Distance.
Longest Common Substring
The longest common substring between two strings is the longest contiguous chain of characters that exists in both strings. The longer the substring, the better the match between the two strings. This simple approach can work very well in practice.
A disadvantage of this approach is that the position of an ‘error’ in the input affects the computed similarity between the two strings. If the error occurs in the middle of the string, then the distance between the two strings will be greater than if the error occurred at one end. For example, suppose we make a simple typing error on the keyboard by pressing the key adjacent to the one intended. With a word such as ‘PINEAPPLE’, typing ‘PINESPPLE’ gives a longest common substring of length 4, whereas ‘OINEAPPLE’ gives a value of 8. The problem is that ‘PINESPPLE’ is deemed to be just as good a match with ‘PINEAPPLE’ as the string ‘PINE’, which is probably not what you want.
Next: Editing and Hamming Distances >>
More Development Cycles Articles
More By Simon White