Lord Of The Strings Part 1 - Loading the Text Files to the DB
(Page 4 of 5 )
In addition to the word and its language, there is an identifier (word_id), which is the primary key for the table. Note that I used an enum type for the lang column, since we know we are only going to use a limited set of languages. By using an enum, only one byte need be used to store the value of that column – far less data than if I’d used a varchar. I also added indexes for the lang and word columns to improve execution times for queries that constrain these columns.
Now for each language, I loaded the words from the text file into the database table. The following two SQL commands load the word list for Danish; the other languages were loaded similarly.
load data infile
'C:\temp\danish.txt' into table words(word);
update words set lang='danish' where lang is null;
Once the word lists were in the database, I carried out one further data cleansing action. I removed any words that were less than three characters long:
mysql
> delete from words where length(word)<3;
Query OK, 2112 rows affected (4.70 sec)
Checking the Data
At this point it is reassuring to run a query to get an overview of the data that we have stored. First, let’s check how many words are in the database. As we have one word per row of the database table, that’s the same as counting the number of rows in the table. The following query counts the number of rows, but also stores that value in a variable, called @total.
mysql
> select @total:=count(*) as wordcount from words;
+-----------+
| wordcount |
+-----------+
| 1343410 |
+-----------+
1 row in set (0.00 sec)
Now let’s look at the breakdown of the word lists into languages:
mysql
> select lang as language, count(word) as wordcount
from words group by lang;
+-----------+-----------+
| language | wordcount |
+-----------+-----------+
| DANISH | 25291 |
| DUTCH | 178341 |
| ENGLISH | 56355 |
| FINNISH | 287231 |
| FRENCH | 138168 |
| GERMAN | 159989 |
| HUNGARIAN | 17818 |
| JAPANESE | 115291 |
| LATIN | 77049 |
| NORWEGIAN | 61679 |
| POLISH | 109343 |
| SPANISH | 85965 |
| SWAHILI | 18363 |
| SWEDISH | 12057 |
| TOLKIEN | 470 |
+-----------+-----------+
15 rows in set (3.55 sec)
Next: Running the Query >>
More MySQL Articles
More By Simon White