In a previous article, Tame the Beast by Matching Similar Strings, I presented a brief survey of approximate string matching algorithms, and argued their importance for information retrieval tasks. A classic example of information retrieval using similarity searching is entering a keyword into the search string box on Amazon’s web site in order to retrieve descriptions of products related to that keyword. Approximate string matching algorithms can be classified as equivalence algorithms and similarity ranking algorithms. In this article, I present a new similarity ranking algorithm, together with its associated string similarity metric. I also include Java source code, so you can easily incorporate the algorithm into your own applications.

How to Strike a Match - The New Metric (Page 2 of 6 )

Given the drawbacks of the existing algorithms, I decided to invent a new string similarity metric that rewards both common substrings and a common ordering of those substrings. In addition, I wanted the algorithm to consider not only the single longest common substring, but other common substrings too.

The requirements may sound difficult to satisfy, but my solution is deceptively simple:

Find out how many adjacent character pairs are contained in both strings.

The intention is that by considering adjacent characters, I take account not only of the characters, but also of the character ordering in the original string, since each character pair contains a little information about the original ordering.

Let me explain the algorithm by comparing the two strings ‘France’ and ‘French’. First, I map them both to their upper case characters (making the algorithm insensitive to case differences), then split them up into their character pairs:

FRANCE: {FR, RA, AN, NC, CE} FRENCH: {FR, RE, EN, NC, CH}

Then I work out which character pairs are in both strings. In this case, the intersection is {FR, NC}. Now, I would like to express my finding as a numeric metric that reflects the size of the intersection relative to the sizes of the original strings. If pairs(x) is the function that generates the pairs of adjacent letters in a string, then my numeric metric of similarity is:

The similarity between two strings s1 and s2 is twice the number of character pairs that are common to both strings divided by the sum of the number of character pairs in the two strings. (The vertical bars in the formula mean ‘size of’.) Note that the formula rates completely dissimilar strings with a similarity value of 0, since the size of the letter-pair intersection in the numerator of the fraction will be zero. On the other hand, if you compare a (non-empty) string to itself, then the similarity is 1. For our comparison of ‘FRANCE’ and ‘FRENCH’, the metric is computed as follows:

Given that the values of the metric always lie between 0 and 1, it is also very natural to express these values as percentages. For example, the similarity between ‘FRANCE’ and ‘FRENCH’ is 40%. From now on, I will express similarity values as percentages, rounded to the nearest whole number.

Ranking Results

Typically, we don’t just want to know how similar two strings are. We want to know which of a set of known strings are most similar to a particular string. For example, which of the strings ‘Heard’, ‘Healthy’, ‘Help’, ‘Herded’, ‘Sealed’ or ‘Sold’ is most similar to the string ‘Healed’?

All we need do to answer the question is find the similarity between ‘Healed’ and each of the other words, and then rank the results in order of these values. The results for this example are presented in Table 1.