Development Cycles
  Home arrow Development Cycles arrow Page 4 - Tame the Beast by Matching Similar Strings
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Dedicated Servers  
Moblin 
JMSL Numerical Library 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
DEVELOPMENT CYCLES

Tame the Beast by Matching Similar Strings
By: Simon White
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 26
    2004-02-09

    Table of Contents:
  • Tame the Beast by Matching Similar Strings
  • Equivalence Methods
  • Synonyms and Regular Expressions
  • The Soundex Algorithm
  • Similarity Ranking Methods
  • Editing and Hamming Distances
  • Conclusions

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Tame the Beast by Matching Similar Strings - The Soundex Algorithm


    (Page 4 of 7 )

    The Soundex algorithm is an attempt to match strings that sound alike. The idea is that you take the two strings of the comparison, map each of them to a new string that represents their phonetics, and then compare those strings for an exact match. The algorithm is only intended to work with English pronunciation, and there are plenty of counter-examples, even in English, where it doesn’t work. However, it is easy to implement and, even better, is already available as a pre-programmed function in the Oracle Database Management System. There’s also a good chance that you are able to find an implementation in your favorite programming language by a quick web search.

    The algorithm works as follows. When mapping the original strings to their phonetic strings, the first letter is always retained, and the rest of the string is processed in a left to right fashion. The subsequent letters of the string are compressed to a three digit code according to the scheme shown in Table 1. Since the first letter is always retained, the algorithm always generates a 4 digit string. The code ‘0’ is used as padding if there are not enough letters in the input string, and any excess letters are disregarded.

    LetterPhonetic Code
    B,F,P,V1
    C,G,J,K,Q,S,X,Z2
    D,T3
    L4
    M,N5
    R6
    A,E,I,O,U,Y,H,Wnot coded

    Table 1: Phonetic Codes in the Soundex Algorithm

    For example, the strings ‘LICENCE’, ‘LICENSE’ and ‘LICENSING’ all map to the same Soundex string, ‘L252’. Additionally,

    1. adjacent pairs of the same consonant are treated as one
    2. adjacent consonants from the same code group are treated as one
    3. a consonant immediately following an initial letter from the same code group is ignored
    4. consonants from the same code group separated by W or H are treated as one

    The Soundex algorithm is interesting because it addresses the pronunciation of words, rather than raw lexical similarity. Its main drawbacks are that it is language dependent, and there are many examples of similar strings that nevertheless produce different Soundex codes. And of course it only provides for comparisons of alphabetic characters - anything outside of the range ‘A’-‘Z’ will simply be ignored.

    The Soundex algorithm is also very old (it is documented in Donald Knuth’s “The Art of Computer Programming", from 1973, but attributed to 1918 and 1922 U.S. Patents by Margaret K. Odell and Robert C. Russell). A more recent attempt at the same problem, called MetaPhone, dates from 1990 and allegedly gives better results. There is a description of MetaPhone on the web, and you can also test the algorithm online against databases of names and place names.

    More Development Cycles Articles
    More By Simon White


     

    DEVELOPMENT CYCLES ARTICLES

    - More Pattern Matching Algorithms: B-M
    - Pattern Matching Algorithms Demystified: KMP
    - Coding Standards
    - A Peek into the Future: Transactional Memory
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - How to Strike a Match
    - Entity Relationship Modeling
    - Tame the Beast by Matching Similar Strings
    - 5 Web Design Tips You Can't Live Without
    - Practising Best Practises in Your Software D...
    - The Art of Modelling: Part 1
    - Thoughts on the Craft of Programming: Abstra...
    - Hi 5: Part 4







    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 5 hosted by Hostway