My interest in string similarity stems from a desire for good user interface design. Computers are seen by many as unfriendly, unforgiving beasts that respond unkindly to requests that are almost meaningful. In this article, I demonstrate how computers can be programmed to be more forgiving of their users’ mistakes, with no additional burden on the user such as learning a special query format. Moreover, the techniques described are very widely applicable and often easy to implement.
Tame the Beast by Matching Similar Strings - Editing and Hamming Distances (Page 6 of 7 )
This method focuses on the most common typing errors, namely character omissions, insertions, substitutions and reversals. The idea is to compute the minimum number of such operations that it would take to transform one string into another. This number gives an indication of the similarities of the strings. A value of 0 indicates that the two strings are identical
The algorithm can be described more generally by associating a cost with each of the operations, and deriving the distance between two strings as the minimum cost that transforms one string into another. There are two widely recognized variations of the edit distance. The Levenshtein Edit Distance is the most common variation and allows insertion, deletion or substitution of a single character where the cost of each operation is 1. The Damerau Edit Distance is identical to the Levenshtein edit distance, except that it also allows the operation of transposing (swapping) two adjacent characters.
Although I do not recommend the Hamming distance for the majority of string-based information retrieval tasks, I should mention it for completeness. The Hamming distance between two character strings is the number of positions in which the characters of the two strings are different. So, for example, the distance between ‘REPAIR’ and ‘REPOSE’ is 3, whereas the distance between ‘WORK’ and ‘REST’ is 4. (According to the literature, strings of different lengths have an infinite Hamming distance between them, so when comparing strings of different lengths, you may decide to ‘cheat’ by padding the shorter of the two strings on the right with extra spaces before comparing the strings.)
The problem with this metric is that apparently very similar strings can be given a high Hamming distance. Consider that there are no two pairs of characters that match positionally in the following two strings: