Development Cycles
  Home arrow Development Cycles arrow Page 3 - Tame the Beast by Matching Similar Strings
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
DEVELOPMENT CYCLES

Tame the Beast by Matching Similar Strings
By: Simon White
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 27
    2004-02-09

    Table of Contents:
  • Tame the Beast by Matching Similar Strings
  • Equivalence Methods
  • Synonyms and Regular Expressions
  • The Soundex Algorithm
  • Similarity Ranking Methods
  • Editing and Hamming Distances
  • Conclusions

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Tame the Beast by Matching Similar Strings - Synonyms and Regular Expressions


    (Page 3 of 7 )

    Synonyms

    In this approach, synonyms of expected inputs are stored explicitly.  For example, with the string ‘television’, you might also store ‘TV’ and ‘televisions’; and with the string ‘license’ you might also store ‘licence’. As with word stemming, user inputs are converted to a canonical form before any further processing.

    This mechanism can provide a forgiving user-interface, and is also language independent. Unfortunately, it does mean that many synonym strings must be prepared in advance to anticipate every possible user input. You could argue that the approach simply increases the number of expected inputs, rather than providing a better algorithm to find the strings that are of real interest. However, if you reconsider the problem of retrieving product descriptions from a database, you should see that there are other advantages. Firstly, as synonyms are resolved close to the user-interface, you can index your products in the system’s back end using a small, controlled keyword vocabulary. Secondly, the architecture provides a clean separation between these user-interface concerns and the integrity of the data.

    Wildcards and Regular Expressions

    When you search for a file on your hard disk, you might use a search pattern with a wildcard such as ‘*.txt’ (anything with the suffix ‘.txt’). In a similar way, applications that perform information retrieval can also employ pattern matching to improve the chances of finding the information of interest. One approach is to expose the full power of regular expressions in the user-interface, but the complex functionality and cryptic syntax usually confuses more than it helps. What we need is a way that harnesses the power of pattern matching without exposing it to the user. One idea is to prepend and append the user’s input with the wild card character, and then use regular expression matching instead of exact matching. This has the effect of searching for all strings that contain the user’s input. Another idea is to take each word (that is, space-separated token) of the input and apply the same wild card prepending and appending. In this case, the input ‘go fish’ would become ‘*go* *fish*’, which matches ‘gone fishing’ as well as ‘go fishing’.

    More Development Cycles Articles
    More By Simon White


     

    DEVELOPMENT CYCLES ARTICLES

    - Division of Large Numbers
    - Branch and Bound Algorithm Technique
    - Dynamic Programming Algorithm Technique
    - Genetic Algorithm Techniques
    - Greedy Strategy as an Algorithm Technique
    - Divide and Conquer Algorithm Technique
    - The Backtracking Algorithm Technique
    - More Pattern Matching Algorithms: B-M
    - Pattern Matching Algorithms Demystified: KMP
    - Coding Standards
    - A Peek into the Future: Transactional Memory
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - How to Strike a Match







    © 2003-2010 by Developer Shed. All rights reserved. DS Cluster 9 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek