When I saw the latest in the Lord of the Rings trilogy of movies a short while ago, I wondered how Tolkien had invented the artificial languages of Middle Earth. In my previous article, I told of my desire to discover which real language had been the biggest influence on Tolkien for his invented ones. As a software developer, I wanted to discover this information algorithmically. My idea was to use my own string similarity algorithm to compare each word from a list of Tolkien words to words from 14 other real languages. For each Tolkien word, I would find and record the language with the word that is (lexically) most similar. The set of most-similar words and the languages from which they came would provide new insights into the influences on Tolkien.
Lord Of The Strings Part 2 - The Size of the Problem (Page 2 of 7 )
In fact, I opted to write the results back to the database, but not for the reason given above. Actually, I was concerned about the size of the problem. There were 470 Tolkien words in the database, each of which could potentially be compared against 1.3 million other strings. That's a lot of computation, and even for a single given word I thought it would surely take a while to compute the most similar string. My idea was to use the persistence of the database to help address the combinatorics of the problem. If I designed the program such that it wrote results to the database as it computed them, and chose words to analyze that are not already in the results list in the database, then the computer could be shut down and restarted and the program would still pick up where it left off.
In other words, very little processing time would be lost through a system shutdown, even if the analysis was not complete. This is very desirable behavior, so I created another database table, called 'matches' for storing the results of the similarity comparisons. The table needed to store the word (or its id), its best matching word, the language of the best matching word, and the similarity score. I used the following SQL command to create it:
CREATE TABLE matches
( word varchar(60), word_id int(10) NOT NULL, best varchar(60), lang enum("DANISH", "DUTCH", "ENGLISH", "FINNISH", "FRENCH", "GERMAN", "HUNGARIAN", "JAPANESE", "LATIN", "NORWEGIAN", "POLISH", "SPANISH", "SWAHILI", "SWEDISH", "TOLKIEN"), similarity float, primary key(word_id), index word_i (word_id) );
(Although not strictly necessary, I have included the word and best word in this table as well as in the words table. I realize this duplicates information and runs the risk of the data becoming inconsistent, but it keeps things simple in this article. The normalization and further refinement of the database schema is left as an exercise for the reader. Good luck!)