Home arrow MySQL arrow Page 2 - Lord Of The Strings Part 1

Lord Of The Strings Part 1

I recently enjoyed the latest in the Lord of the Rings trilogy of movies at the cinema. I was intrigued by Tolkien’s invented languages (such as Elvish and Dwarvish) and was curious to know where the languages came from, or more precisely, which real language was the biggest influence on Tolkien for his inventions. As I have been thinking about issues of string similarity recently (see Matching Strings and Algorithms), I wondered whether I could extend my ideas of string similarity to language similarity. In other words, could I discover to which real language Tolkien’s artificial language is most similar?

Author Info:
By: Simon White
Rating: 5 stars5 stars5 stars5 stars5 stars / 55
March 15, 2004
  1. · Lord Of The Strings Part 1
  2. · Word Lists and Databases
  3. · Word Lists
  4. · Loading the Text Files to the DB
  5. · Running the Query

print this article

Lord Of The Strings Part 1 - Word Lists and Databases
(Page 2 of 5 )

Although I had an existing implementation of the string similarity metric and a good idea of the basic approach, this was a truly investigative project. I didn’t know what the outcome was going to be, and I knew there would be some problems to solve along the way. But then, that’s what makes it so pleasing when you do get a result. In this article, I explain the first part of my investigation — how I obtained the word lists that enabled me to do the analysis, how I processed them to clean them up, and how I represented the word lists in a database.

Acquiring and Cleaning the Word Lists

A quick Google search led me to believe that I should be able to get the data sources I needed to do the investigation – I found suitable word lists at phreak.org and cotse.com, including a list of Tolkien’s invented words.

After downloading a number of these word lists, I found that they needed some ‘cleaning’ before I could use them. I wanted each file to be a list of words, formatted as one word per line. This was not the case with several of the downloaded files, so I found myself ‘cleaning’ the data. For these basic file manipulation and formatting tasks, I found the speed and flexibility of the Unix-style bash shell invaluable. The tasks were as follows:

  1. Using a text editor, I deleted any explanatory text or comments from the tops of the files.
  2. I found that the list of Hungarian words had first a word, but then a number on each line. I stripped the numbers from this file using the simple awk script ‘{print $1}’.
  3. I sorted the files, and then used a text editor to remove non-alphabetic words, such as ‘>=’.
  4. I combined different word lists for the same language. For example, there were multiple word lists for English, so I simply appended one list onto the end of the other:
    cat englex-dict.txt >> english.txt
  5. I removed duplicates from all of the word lists. It is easy to do this programmatically in the bash shell. For example:
    cat english.txt | sort | uniq > new-english.txt
  6. I made sure that all the files used the same line termination sequence. (Text files developed under Windows use two characters, Carriage Return and Line Feed, to signify the end of a line, whereas Unix just uses a Carriage Return.)
As I was using the bash shell, it was easiest to convert all the files to Unix file format using:  dos2unix *.txt

blog comments powered by Disqus

- MySQL and BLOBs
- Two Lessons in ASP and MySQL
- Lord Of The Strings Part 2
- Lord Of The Strings Part 1
- Importing Data into MySQL with Navicat
- Building a Sustainable Web Site
- Creating An Online Photo Album with PHP and ...
- Creating An Online Photo Album with PHP and ...
- PhpED 3.2 – More Features Than You Can Poke ...
- Creating An Online Photo Album with PHP and ...
- Creating An Online Photo Album with PHP and ...
- Security and Sessions in PHP
- Setup Your Personal Reminder System Using PHP
- Create a IP-Country Database Using PERL and ...
- Developing a Dynamic Document Search in PHP ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2019 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials