Development Cycles
  Home arrow Development Cycles arrow Page 2 - How to Strike a Match
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
DEVELOPMENT CYCLES

How to Strike a Match
By: Simon White
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 35
    2004-04-07

    Table of Contents:
  • How to Strike a Match
  • The New Metric
  • A Real World Example
  • A Java Implementation
  • Finishing the Java Implementation
  • Summary

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    How to Strike a Match - The New Metric


    (Page 2 of 6 )

    Given the drawbacks of the existing algorithms, I decided to invent a new string similarity metric that rewards both common substrings and a common ordering of those substrings. In addition, I wanted the algorithm to consider not only the single longest common substring, but other common substrings too.

    The requirements may sound difficult to satisfy, but my solution is deceptively simple:

    Find out how many adjacent character pairs are contained in both strings.

    The intention is that by considering adjacent characters, I take account not only of the characters, but also of the character ordering in the original string, since each character pair contains a little information about the original ordering.

    Let me explain the algorithm by comparing the two strings ‘France’ and ‘French’. First, I map them both to their upper case characters (making the algorithm insensitive to case differences), then split them up into their character pairs:

    FRANCE: {FR, RA, AN, NC, CE}
    FRENCH: {FR, RE, EN, NC, CH}

    Then I work out which character pairs are in both strings. In this case, the intersection is {FR, NC}. Now, I would like to express my finding as a numeric metric that reflects the size of the intersection relative to the sizes of the original strings. If pairs(x) is the function that generates the pairs of adjacent letters in a string, then my numeric metric of similarity is:

    String Matching Algorithm

    The similarity between two strings s1 and s2 is twice the number of character pairs that are common to both strings divided by the sum of the number of character pairs in the two strings. (The vertical bars in the formula mean ‘size of’.) Note that the formula rates completely dissimilar strings with a similarity value of 0, since the size of the letter-pair intersection in the numerator of the fraction will be zero. On the other hand, if you compare a (non-empty) string to itself, then the similarity is 1. For our comparison of ‘FRANCE’ and ‘FRENCH’, the metric is computed as follows: 

    String Matching Algorithm

    Given that the values of the metric always lie between 0 and 1, it is also very natural to express these values as percentages. For example, the similarity between ‘FRANCE’ and ‘FRENCH’ is 40%. From now on, I will express similarity values as percentages, rounded to the nearest whole number.

    Ranking Results

    Typically, we don’t just want to know how similar two strings are. We want to know which of a set of known strings are most similar to a particular string. For example, which of the strings ‘Heard’, ‘Healthy’, ‘Help’, ‘Herded’, ‘Sealed’ or ‘Sold’ is most similar to the string ‘Healed’?

    All we need do to answer the question is find the similarity between ‘Healed’ and each of the other words, and then rank the results in order of these values. The results for this example are presented in Table 1.

    WordSimilarity
    Sealed80%
    Healthy55%
    Heard44%
    Herded40%
    Help25%
    Sold0%

    Table 1: Find the word most similar to "Healed"

    More Development Cycles Articles
    More By Simon White


     

    DEVELOPMENT CYCLES ARTICLES

    - Division of Large Numbers
    - Branch and Bound Algorithm Technique
    - Dynamic Programming Algorithm Technique
    - Genetic Algorithm Techniques
    - Greedy Strategy as an Algorithm Technique
    - Divide and Conquer Algorithm Technique
    - The Backtracking Algorithm Technique
    - More Pattern Matching Algorithms: B-M
    - Pattern Matching Algorithms Demystified: KMP
    - Coding Standards
    - A Peek into the Future: Transactional Memory
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - Learning About the Graph Construct using Gam...
    - How to Strike a Match







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 3 Hosted by Hostway
    Stay green...Green IT