My interest in string similarity stems from a desire for good user interface design. Computers are seen by many as unfriendly, unforgiving beasts that respond unkindly to requests that are almost meaningful. In this article, I demonstrate how computers can be programmed to be more forgiving of their users’ mistakes, with no additional burden on the user such as learning a special query format. Moreover, the techniques described are very widely applicable and often easy to implement.
Tame the Beast by Matching Similar Strings - Equivalence Methods (Page 2 of 7 )
Equivalence methods compare two strings and return a value of true or false according to whether the method deems those two strings to be, in some sense, equivalent. In terms of user-interface design, your application can be more forgiving of user inputs if it accepts equivalent strings instead of only exact matches. A simple example of equivalence is to treat ‘Tweetle-Beetle Battle’ the same as ‘TWEETLE BEETLE BATTLE’ despite the differences in case, and the replacement of a hyphen with a space in the second string.
Word Stemming
Word stemming is a technique that reduces closely related words to a basic canonical form or ‘stem’. For example, the user inputs ‘swims’ and ‘swimming’ can be reduced to the basic stem ‘swim’ before performing an exact match against expected inputs. Stemming makes use of a suffix dictionary that contains lists of possible word endings. However, such a list is clearly language-dependent and even regional differences of the same language must be considered (for example, compare British spelling ‘standardise’ with American spelling ‘standardize’). Also, not all languages lend themselves to such treatment, although it has been demonstrated for most languages of the Indo-European family (which includes Latin-based and Germanic languages).
Deriving stemming algorithms is a difficult, time-consuming and error-prone activity. Therefore, for application building, I can only recommend using tools such as Snowball, with its suite of existing stemming algorithms for many languages.