I recently enjoyed the latest in the Lord of the Rings trilogy of movies at the cinema. I was intrigued by Tolkien’s invented languages (such as Elvish and Dwarvish) and was curious to know where the languages came from, or more precisely, which real language was the biggest influence on Tolkien for his inventions. As I have been thinking about issues of string similarity recently (see Matching Strings and Algorithms), I wondered whether I could extend my ideas of string similarity to language similarity. In other words, could I discover to which real language Tolkien’s artificial language is most similar?
Lord Of The Strings Part 1 - Word Lists (Page 3 of 5 )
At this point, I had 15 word lists of different sizes, and a total of over 1.3 million words (the wc command shows the number of lines, words and characters in each file):
Storing the Word Lists in a Database Now that the word lists had been cleaned, my next aim was to access them from a computer program. Although I could have written a program to access the word lists directly as files, I felt a database would offer considerable flexibility to query the data and analyze the results. I was also worried about the volume of data, and reasoned that the database would help in accessing and managing the word lists efficiently. I didn’t look around much when choosing a database to store the word lists — MySQL was the natural choice because it is fast, flexible and above all, free. And besides, it was already installed on my computer!
I knew I would need only a single table to store all the word lists in the database. Each row of the table could hold one word together with the language to which it belongs. However, to devise the schema precisely, I needed to find out how many characters to allow per word. A quick bash shell command against the text files told me the lengths of the words in the word lists:
$ cat *.txt|awk '{print length($0)}'|sort –n|uniq
The command first runs an awk script over the text files to get the lengths of the lines, then performs a numeric sort, and finally removes duplicate lines in the output. Using this command, I found that the longest word in the input was 57 characters, so decided to make the database column to hold the words 60 characters long. The table for storing the words is created as follows:
CREATE TABLE words
( word varchar(60), lang enum("DANISH", "DUTCH", "ENGLISH", "FINNISH", "FRENCH", "GERMAN", "HUNGARIAN", "JAPANESE", "LATIN", "NORWEGIAN", "POLISH", "SPANISH", "SWAHILI", "SWEDISH", "TOLKIEN"), word_id int(10) NOT NULL auto_increment, primary key (word_id), index lang_i (lang), index word_i (word) );