Lord Of The Strings Part 1 - Word Lists and Databases
(Page 2 of 5 )
Although I had an existing implementation of the string similarity metric and a good idea of the basic approach, this was a truly investigative project. I didn’t know what the outcome was going to be, and I knew there would be some problems to solve along the way. But then, that’s what makes it so pleasing when you do get a result. In this article, I explain the first part of my investigation — how I obtained the word lists that enabled me to do the analysis, how I processed them to clean them up, and how I represented the word lists in a database.
Acquiring and Cleaning the Word Lists
A quick Google search led me to believe that I should be able to get the data sources I needed to do the investigation – I found suitable word lists at phreak.org and cotse.com, including a list of Tolkien’s invented words.
After downloading a number of these word lists, I found that they needed some ‘cleaning’ before I could use them. I wanted each file to be a list of words, formatted as one word per line. This was not the case with several of the downloaded files, so I found myself ‘cleaning’ the data. For these basic file manipulation and formatting tasks, I found the speed and flexibility of the Unix-style bash shell invaluable. The tasks were as follows:
- Using a text editor, I deleted any explanatory text or comments from the tops of the files.
- I found that the list of Hungarian words had first a word, but then a number on each line. I stripped the numbers from this file using the simple awk script ‘{print $1}’.
- I sorted the files, and then used a text editor to remove non-alphabetic words, such as ‘>=’.
- I combined different word lists for the same language. For example, there were multiple word lists for English, so I simply appended one list onto the end of the other:
cat englex-dict.txt >> english.txt - I removed duplicates from all of the word lists. It is easy to do this programmatically in the bash shell. For example:
cat english.txt | sort | uniq > new-english.txt - I made sure that all the files used the same line termination sequence. (Text files developed under Windows use two characters, Carriage Return and Line Feed, to signify the end of a line, whereas Unix just uses a Carriage Return.)
As I was using the bash shell, it was easiest to convert all the files to Unix file format using:
dos2unix *.txt Next: Word Lists >>
More MySQL Articles
More By Simon White