Development Cycles
  Home arrow Development Cycles arrow Page 6 - Tame the Beast by Matching Similar Strings
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Moblin 
JMSL Numerical Library 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
DEVELOPMENT CYCLES

Tame the Beast by Matching Similar Strings
By: Simon White
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 27
    2004-02-09

    Table of Contents:
  • Tame the Beast by Matching Similar Strings
  • Equivalence Methods
  • Synonyms and Regular Expressions
  • The Soundex Algorithm
  • Similarity Ranking Methods
  • Editing and Hamming Distances
  • Conclusions

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Tame the Beast by Matching Similar Strings - Editing and Hamming Distances


    (Page 6 of 7 )

    Edit Distance

    This method focuses on the most common typing errors, namely character omissions, insertions, substitutions and reversals. The idea is to compute the minimum number of such operations that it would take to transform one string into another. This number gives an indication of the similarities of the strings. A value of 0 indicates that the two strings are identical

    The algorithm can be described more generally by associating a cost with each of the operations, and deriving the distance between two strings as the minimum cost that transforms one string into another. There are two widely recognized variations of the edit distance. The Levenshtein Edit Distance is the most common variation and allows insertion, deletion or substitution of a single character where the cost of each operation is 1. The Damerau Edit Distance is identical to the Levenshtein edit distance, except that it also allows the operation of transposing (swapping) two adjacent characters.

    Implementations of the Levenshtein edit distance can be found on the web.

    Hamming Distance

    Although I do not recommend the Hamming distance for the majority of string-based information retrieval tasks, I should mention it for completeness. The Hamming distance between two character strings is the number of positions in which the characters of the two strings are different. So, for example, the distance between ‘REPAIR’ and ‘REPOSE’ is 3, whereas the distance between ‘WORK’ and ‘REST’ is 4. (According to the literature, strings of different lengths have an infinite Hamming distance between them, so when comparing strings of different lengths, you may decide to ‘cheat’ by padding the shorter of the two strings on the right with extra spaces before comparing the strings.)

    The problem with this metric is that apparently very similar strings can be given a high Hamming distance. Consider that there are no two pairs of characters that match positionally in the following two strings:

    1. ‘THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG’
    2. ‘MY QUICK BROWN FOX JUMPED OVER THE LAZY DOGS’

    More Development Cycles Articles
    More By Simon White


     

    DEVELOPMENT CYCLES ARTICLES

    - Genetic Algorithm Techniques
    - Greedy Strategy as an Algorithm Technique
    - Divide and Conquer Algorithm Technique
    - The Backtracking Algorithm Technique
    - More Pattern Matching Algorithms: B-M
    - Pattern Matching Algorithms Demystified: KMP
    - Coding Standards
    - A Peek into the Future: Transactional Memory
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - How to Strike a Match
    - Entity Relationship Modeling
    - Tame the Beast by Matching Similar Strings
    - 5 Web Design Tips You Can't Live Without






    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway
    Stay green...Green IT