Tame the Beast by Matching Similar Strings - Conclusions
(Page 7 of 7 )
Many information retrieval systems use a string-based query, often just a single string, to find the information of interest to its user. The ability of the system to find relevant information based on the user’s input is key to a successful system. This ability can be significantly enhanced by employing an approximate string matching algorithm (which need not be invoked until it is known that no exact matches exist). Conversely, failure to find relevant information (particularly when the user knows it to be present) serves only to frustrate, and perpetuate the myth of the cantankerous computer.
I described the algorithms in two classes: equivalence methods and similarity ranking methods. Equivalence methods return a Boolean result, whereas the similarity ranking methods return a numeric similarity measure or distance metric. In information retrieval systems, it is possible to mix methods to produce a faster hybrid approach. A typical approach is to employ a two-pass mechanism in which an equivalence method is used by the database as a first pass filter, and a ranked similarity method is applied to the filtered entries for the second pass. Ranked similarity methods tend to be algorithmically more complex than equivalence methods, so are usually implemented as custom code outside of the database.
When choosing an algorithm to use, there are several criteria that will influence your choice of algorithm. For example, what kinds of mismatch are you attempting to recover from? Are you trying to recover from typing errors? Or are you trying to find ‘sound-alike’ or look-alike strings? Do the users of the system all speak the same language, or does the method need to be language independent? Do the results need to be ranked in order of similarity? How many strings will the algorithm have to compare, and how fast must it run? Is a two-pass mechanism appropriate or necessary?
Lastly, you might imagine that this area of computing has been so well explored that the best algorithms have already been found and are well-known. However, it is still a research area and I also wouldn’t be at all surprised if the teams at Google are working on novel approximate string-matching algorithms right now!
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |