I recently enjoyed the latest in the Lord of the Rings trilogy of movies at the cinema. I was intrigued by Tolkienís invented languages (such as Elvish and Dwarvish) and was curious to know where the languages came from, or more precisely, which real language was the biggest influence on Tolkien for his inventions. As I have been thinking about issues of string similarity recently (see Matching Strings and Algorithms), I wondered whether I could extend my ideas of string similarity to language similarity. In other words, could I discover to which real language Tolkienís artificial language is most similar?
Lord Of The Strings Part 1 - Loading the Text Files to the DB (Page 4 of 5 )
In addition to the word and its language, there is an identifier (word_id), which is the primary key for the table. Note that I used an enum type for the lang column, since we know we are only going to use a limited set of languages. By using an enum, only one byte need be used to store the value of that column Ė far less data than if Iíd used a varchar. I also added indexes for the lang and word columns to improve execution times for queries that constrain these columns. Now for each language, I loaded the words from the text file into the database table. The following two SQL commands load the word list for Danish; the other languages were loaded similarly.
load data infile
'C:\temp\danish.txt' into table words(word); update words set lang='danish' where lang is null;
Once the word lists were in the database, I carried out one further data cleansing action. I removed any words that were less than three characters long:
> delete from words where length(word)<3; Query OK, 2112 rows affected (4.70 sec)
Checking the Data At this point it is reassuring to run a query to get an overview of the data that we have stored. First, letís check how many words are in the database. As we have one word per row of the database table, thatís the same as counting the number of rows in the table. The following query counts the number of rows, but also stores that value in a variable, called @total.
> select @total:=count(*) as wordcount from words; +-----------+ | wordcount | +-----------+ | 1343410 | +-----------+ 1 row in set (0.00 sec)
Now letís look at the breakdown of the word lists into languages:
> select lang as language, count(word) as wordcount from words group by lang; +-----------+-----------+ | language | wordcount | +-----------+-----------+ | DANISH | 25291 | | DUTCH | 178341 | | ENGLISH | 56355 | | FINNISH | 287231 | | FRENCH | 138168 | | GERMAN | 159989 | | HUNGARIAN | 17818 | | JAPANESE | 115291 | | LATIN | 77049 | | NORWEGIAN | 61679 | | POLISH | 109343 | | SPANISH | 85965 | | SWAHILI | 18363 | | SWEDISH | 12057 | | TOLKIEN | 470 | +-----------+-----------+ 15 rows in set (3.55 sec)