Tame the Beast by Matching Similar Strings - Equivalence Methods
(Page 2 of 7 )
Equivalence methods compare two strings and return a value of true or false according to whether the method deems those two strings to be, in some sense, equivalent. In terms of user-interface design, your application can be more forgiving of user inputs if it accepts equivalent strings instead of only exact matches. A simple example of equivalence is to treat ‘Tweetle-Beetle Battle’ the same as ‘TWEETLE BEETLE BATTLE’ despite the differences in case, and the replacement of a hyphen with a space in the second string.
Word Stemming
Word stemming is a technique that reduces closely related words to a basic canonical form or ‘stem’. For example, the user inputs ‘swims’ and ‘swimming’ can be reduced to the basic stem ‘swim’ before performing an exact match against expected inputs. Stemming makes use of a suffix dictionary that contains lists of possible word endings. However, such a list is clearly language-dependent and even regional differences of the same language must be considered (for example, compare British spelling ‘standardise’ with American spelling ‘standardize’). Also, not all languages lend themselves to such treatment, although it has been demonstrated for most languages of the Indo-European family (which includes Latin-based and Germanic languages).
Deriving stemming algorithms is a difficult, time-consuming and error-prone activity. Therefore, for application building, I can only recommend using tools such as Snowball, with its suite of existing stemming algorithms for many languages.
Next: Synonyms and Regular Expressions >>
More Development Cycles Articles
More By Simon White