Home arrow Development Cycles arrow Page 2 - How to Strike a Match
DEVELOPMENT CYCLES

How to Strike a Match


In a previous article, Tame the Beast by Matching Similar Strings, I presented a brief survey of approximate string matching algorithms, and argued their importance for information retrieval tasks. A classic example of information retrieval using similarity searching is entering a keyword into the search string box on Amazon’s web site in order to retrieve descriptions of products related to that keyword. Approximate string matching algorithms can be classified as equivalence algorithms and similarity ranking algorithms. In this article, I present a new similarity ranking algorithm, together with its associated string similarity metric. I also include Java source code, so you can easily incorporate the algorithm into your own applications.

Author Info:
By: Simon White
Rating: 5 stars5 stars5 stars5 stars5 stars / 39
April 07, 2004
TABLE OF CONTENTS:
  1. · How to Strike a Match
  2. · The New Metric
  3. · A Real World Example
  4. · A Java Implementation
  5. · Finishing the Java Implementation
  6. · Summary

print this article
SEARCH DEVARTICLES

How to Strike a Match - The New Metric
(Page 2 of 6 )

Given the drawbacks of the existing algorithms, I decided to invent a new string similarity metric that rewards both common substrings and a common ordering of those substrings. In addition, I wanted the algorithm to consider not only the single longest common substring, but other common substrings too.

The requirements may sound difficult to satisfy, but my solution is deceptively simple:

Find out how many adjacent character pairs are contained in both strings.

The intention is that by considering adjacent characters, I take account not only of the characters, but also of the character ordering in the original string, since each character pair contains a little information about the original ordering.

Let me explain the algorithm by comparing the two strings ‘France’ and ‘French’. First, I map them both to their upper case characters (making the algorithm insensitive to case differences), then split them up into their character pairs:

FRANCE: {FR, RA, AN, NC, CE}
FRENCH: {FR, RE, EN, NC, CH}

Then I work out which character pairs are in both strings. In this case, the intersection is {FR, NC}. Now, I would like to express my finding as a numeric metric that reflects the size of the intersection relative to the sizes of the original strings. If pairs(x) is the function that generates the pairs of adjacent letters in a string, then my numeric metric of similarity is:

String Matching Algorithm

The similarity between two strings s1 and s2 is twice the number of character pairs that are common to both strings divided by the sum of the number of character pairs in the two strings. (The vertical bars in the formula mean ‘size of’.) Note that the formula rates completely dissimilar strings with a similarity value of 0, since the size of the letter-pair intersection in the numerator of the fraction will be zero. On the other hand, if you compare a (non-empty) string to itself, then the similarity is 1. For our comparison of ‘FRANCE’ and ‘FRENCH’, the metric is computed as follows: 

String Matching Algorithm

Given that the values of the metric always lie between 0 and 1, it is also very natural to express these values as percentages. For example, the similarity between ‘FRANCE’ and ‘FRENCH’ is 40%. From now on, I will express similarity values as percentages, rounded to the nearest whole number.

Ranking Results

Typically, we don’t just want to know how similar two strings are. We want to know which of a set of known strings are most similar to a particular string. For example, which of the strings ‘Heard’, ‘Healthy’, ‘Help’, ‘Herded’, ‘Sealed’ or ‘Sold’ is most similar to the string ‘Healed’?

All we need do to answer the question is find the similarity between ‘Healed’ and each of the other words, and then rank the results in order of these values. The results for this example are presented in Table 1.

WordSimilarity
Sealed80%
Healthy55%
Heard44%
Herded40%
Help25%
Sold0%

Table 1: Find the word most similar to "Healed"


blog comments powered by Disqus
DEVELOPMENT CYCLES ARTICLES

- Division of Large Numbers
- Branch and Bound Algorithm Technique
- Dynamic Programming Algorithm Technique
- Genetic Algorithm Techniques
- Greedy Strategy as an Algorithm Technique
- Divide and Conquer Algorithm Technique
- The Backtracking Algorithm Technique
- More Pattern Matching Algorithms: B-M
- Pattern Matching Algorithms Demystified: KMP
- Coding Standards
- A Peek into the Future: Transactional Memory
- Learning About the Graph Construct using Gam...
- Learning About the Graph Construct using Gam...
- Learning About the Graph Construct using Gam...
- How to Strike a Match

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials