Home arrow Development Cycles arrow Page 7 - Tame the Beast by Matching Similar Strings

Tame the Beast by Matching Similar Strings

My interest in string similarity stems from a desire for good user interface design. Computers are seen by many as unfriendly, unforgiving beasts that respond unkindly to requests that are almost meaningful. In this article, I demonstrate how computers can be programmed to be more forgiving of their users’ mistakes, with no additional burden on the user such as learning a special query format. Moreover, the techniques described are very widely applicable and often easy to implement.

Author Info:
By: Simon White
Rating: 5 stars5 stars5 stars5 stars5 stars / 39
February 09, 2004
  1. · Tame the Beast by Matching Similar Strings
  2. · Equivalence Methods
  3. · Synonyms and Regular Expressions
  4. · The Soundex Algorithm
  5. · Similarity Ranking Methods
  6. · Editing and Hamming Distances
  7. · Conclusions

print this article

Tame the Beast by Matching Similar Strings - Conclusions
(Page 7 of 7 )

Many information retrieval systems use a string-based query, often just a single string, to find the information of interest to its user. The ability of the system to find relevant information based on the user’s input is key to a successful system. This ability can be significantly enhanced by employing an approximate string matching algorithm (which need not be invoked until it is known that no exact matches exist). Conversely, failure to find relevant information (particularly when the user knows it to be present) serves only to frustrate, and perpetuate the myth of the cantankerous computer.

I described the algorithms in two classes: equivalence methods and similarity ranking methods. Equivalence methods return a Boolean result, whereas the similarity ranking methods return a numeric similarity measure or distance metric. In information retrieval systems, it is possible to mix methods to produce a faster hybrid approach. A typical approach is to employ a two-pass mechanism in which an equivalence method is used by the database as a first pass filter, and a ranked similarity method is applied to the filtered entries for the second pass. Ranked similarity methods tend to be algorithmically more complex than equivalence methods, so are usually implemented as custom code outside of the database.

When choosing an algorithm to use, there are several criteria that will influence your choice of algorithm. For example, what kinds of mismatch are you attempting to recover from? Are you trying to recover from typing errors? Or are you trying to find ‘sound-alike’ or look-alike strings? Do the users of the system all speak the same language, or does the method need to be language independent? Do the results need to be ranked in order of similarity? How many strings will the algorithm have to compare, and how fast must it run? Is a two-pass mechanism appropriate or necessary?

Lastly, you might imagine that this area of computing has been so well explored that the best algorithms have already been found and are well-known. However, it is still a research area and I also wouldn’t be at all surprised if the teams at Google are working on novel approximate string-matching algorithms right now!

DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

blog comments powered by Disqus

- Division of Large Numbers
- Branch and Bound Algorithm Technique
- Dynamic Programming Algorithm Technique
- Genetic Algorithm Techniques
- Greedy Strategy as an Algorithm Technique
- Divide and Conquer Algorithm Technique
- The Backtracking Algorithm Technique
- More Pattern Matching Algorithms: B-M
- Pattern Matching Algorithms Demystified: KMP
- Coding Standards
- A Peek into the Future: Transactional Memory
- Learning About the Graph Construct using Gam...
- Learning About the Graph Construct using Gam...
- Learning About the Graph Construct using Gam...
- How to Strike a Match

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2018 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials