I recently enjoyed the latest in the Lord of the Rings trilogy of movies at the cinema. I was intrigued by Tolkien’s invented languages (such as Elvish and Dwarvish) and was curious to know where the languages came from, or more precisely, which real language was the biggest influence on Tolkien for his inventions. As I have been thinking about issues of string similarity recently (see Matching Strings and Algorithms), I wondered whether I could extend my ideas of string similarity to language similarity. In other words, could I discover to which real language Tolkien’s artificial language is most similar?
Lord Of The Strings Part 1 - Word Lists and Databases (Page 2 of 5 )
Although I had an existing implementation of the string similarity metric and a good idea of the basic approach, this was a truly investigative project. I didn’t know what the outcome was going to be, and I knew there would be some problems to solve along the way. But then, that’s what makes it so pleasing when you do get a result. In this article, I explain the first part of my investigation — how I obtained the word lists that enabled me to do the analysis, how I processed them to clean them up, and how I represented the word lists in a database. Acquiring and Cleaning the Word Lists A quick Google search led me to believe that I should be able to get the data sources I needed to do the investigation – I found suitable word lists at phreak.org and cotse.com, including a list of Tolkien’s invented words.
After downloading a number of these word lists, I found that they needed some ‘cleaning’ before I could use them. I wanted each file to be a list of words, formatted as one word per line. This was not the case with several of the downloaded files, so I found myself ‘cleaning’ the data. For these basic file manipulation and formatting tasks, I found the speed and flexibility of the Unix-style bash shell invaluable. The tasks were as follows:
Using a text editor, I deleted any explanatory text or comments from the tops of the files.
I found that the list of Hungarian words had first a word, but then a number on each line. I stripped the numbers from this file using the simple awk script ‘{print $1}’.
I sorted the files, and then used a text editor to remove non-alphabetic words, such as ‘>=’.
I combined different word lists for the same language. For example, there were multiple word lists for English, so I simply appended one list onto the end of the other: cat englex-dict.txt >> english.txt
I removed duplicates from all of the word lists. It is easy to do this programmatically in the bash shell. For example: cat english.txt | sort | uniq > new-english.txt
I made sure that all the files used the same line termination sequence. (Text files developed under Windows use two characters, Carriage Return and Line Feed, to signify the end of a line, whereas Unix just uses a Carriage Return.)
As I was using the bash shell, it was easiest to convert all the files to Unix file format using: dos2unix *.txt