My interest in string similarity stems from a desire for good user interface design. Computers are seen by many as unfriendly, unforgiving beasts that respond unkindly to requests that are almost meaningful. In this article, I demonstrate how computers can be programmed to be more forgiving of their users’ mistakes, with no additional burden on the user such as learning a special query format. Moreover, the techniques described are very widely applicable and often easy to implement.
Tame the Beast by Matching Similar Strings - Synonyms and Regular Expressions (Page 3 of 7 )
In this approach, synonyms of expected inputs are stored explicitly. For example, with the string ‘television’, you might also store ‘TV’ and ‘televisions’; and with the string ‘license’ you might also store ‘licence’. As with word stemming, user inputs are converted to a canonical form before any further processing.
This mechanism can provide a forgiving user-interface, and is also language independent. Unfortunately, it does mean that many synonym strings must be prepared in advance to anticipate every possible user input. You could argue that the approach simply increases the number of expected inputs, rather than providing a better algorithm to find the strings that are of real interest. However, if you reconsider the problem of retrieving product descriptions from a database, you should see that there are other advantages. Firstly, as synonyms are resolved close to the user-interface, you can index your products in the system’s back end using a small, controlled keyword vocabulary. Secondly, the architecture provides a clean separation between these user-interface concerns and the integrity of the data.
Wildcards and Regular Expressions
When you search for a file on your hard disk, you might use a search pattern with a wildcard such as ‘*.txt’ (anything with the suffix ‘.txt’). In a similar way, applications that perform information retrieval can also employ pattern matching to improve the chances of finding the information of interest. One approach is to expose the full power of regular expressions in the user-interface, but the complex functionality and cryptic syntax usually confuses more than it helps. What we need is a way that harnesses the power of pattern matching without exposing it to the user. One idea is to prepend and append the user’s input with the wild card character, and then use regular expression matching instead of exact matching. This has the effect of searching for all strings that contain the user’s input. Another idea is to take each word (that is, space-separated token) of the input and apply the same wild card prepending and appending. In this case, the input ‘go fish’ would become ‘*go* *fish*’, which matches ‘gone fishing’ as well as ‘go fishing’.