Lord Of The Strings Part 1 - Word Lists
(Page 3 of 5 )
At this point, I had 15 word lists of different sizes, and a total of over 1.3 million words (the wc command shows the number of lines, words and characters in each file):
$ wc *.txt
25485 25485 259551 danish.txt
178429 178430 1998881 dutch.txt
56553 56553 509773 english.txt
287698 287698 3500749 finnish.txt
138257 138257 1524757 french.txt
160086 160086 2060734 german.txt
18028 18028 172943 hungarian.txt
115506 115506 934652 japanese.txt
77107 77107 850131 latin.txt
61843 61843 589234 norwegian.txt
109861 109861 1022137 polish.txt
86061 86061 850532 spanish.txt
18417 18417 181973 swahili.txt
12146 12146 105192 swedish.txt
470 470 3768 tolkien.txt
1345947 1345948 14565007 total
Storing the Word Lists in a Database
Now that the word lists had been cleaned, my next aim was to access them from a computer program. Although I could have written a program to access the word lists directly as files, I felt a database would offer considerable flexibility to query the data and analyze the results. I was also worried about the volume of data, and reasoned that the database would help in accessing and managing the word lists efficiently. I didn’t look around much when choosing a database to store the word lists — MySQL was the natural choice because it is fast, flexible and above all, free. And besides, it was already installed on my computer!
I knew I would need only a single table to store all the word lists in the database. Each row of the table could hold one word together with the language to which it belongs. However, to devise the schema precisely, I needed to find out how many characters to allow per word. A quick bash shell command against the text files told me the lengths of the words in the word lists:
$ cat *.txt|awk '{print length($0)}'|sort –n|uniq
The command first runs an awk script over the text files to get the lengths of the lines, then performs a numeric sort, and finally removes duplicate lines in the output. Using this command, I found that the longest word in the input was 57 characters, so decided to make the database column to hold the words 60 characters long.
The table for storing the words is created as follows:
CREATE TABLE words
(
word varchar(60),
lang enum("DANISH", "DUTCH", "ENGLISH", "FINNISH", "FRENCH", "GERMAN", "HUNGARIAN", "JAPANESE", "LATIN", "NORWEGIAN", "POLISH", "SPANISH", "SWAHILI", "SWEDISH", "TOLKIEN"),
word_id int(10) NOT NULL auto_increment,
primary key (word_id),
index lang_i (lang),
index word_i (word)
);
Next: Loading the Text Files to the DB >>
More MySQL Articles
More By Simon White