Developing a Dynamic Document Search in PHP Part 1/2 - The Article
(Page 2 of 5 )
The database consists of three tables: a content Table, keyword Table, and a link table. The content table holds an article’s title and abstract. The keyword table holds keyword. The keyword field is indexed. Lastly, the link table holds the keyword id and content id.
The SQL statements for creating these three tables are shown below:
Content Table:
CREATE TABLE content (
contid mediumint(9) NOT NULL auto_increment,
title text,
abstract longtext,
PRIMARY KEY (contid) ) TYPE=MyISAM;
Keyword Table:
CREATE TABLE keytable (
keyid mediumint NOT NULL auto_increment,
keyword varchar(100) default NULL,
PRIMARY KEY (keyid),
KEY keyword (keyword) ) TYPE=MyISAM;
Link Table:
CREATE TABLE link (
keyid mediumint NOT NULL,
contid mediumint NOT NULL)
TYPE=MyISAM
Preparing the Database
An input interface with a HTML form is created to enter the title and content. After filling in these forms and hitting enter, the title and the abstract are stored in the content table. The generated new content ID is stored in a variable temporarily. The next step is the ‘Upload Engine’ that parses each word in the abstract and processes the whole text. It removes common words like is, was, and, if, so, else, then etc. Each word is then stored in a wordmap array. Each word will only have one entry in the wordmap array.
For every word in the wordmap array, a keyword table is parsed and a match tries to be found. If there is a match, the generated key ID and content ID generated earlier are stored in the link table. If not, the new keyword is inserted in the keyword table. And thus we finished preparing our database.
The code snippet given below explains every step of the program.
Searching the keyword table for every word is a long process. This also reduces the efficiency of the program. To implement this, all of the keywords in the keyword table are stored in an associative array called $allWords. An associative array is one which works on a B-Tree algorithm and is very useful for performing searches. Here is the function:
Function LoadCurrenrWords(){
global $allWords;
$result = mysql_query( "select keyid, keyword from keytable" ) or die( "Error in executing mysql query" );
while ( $row = mysql_fetch_array($result) ) {
$allWords[$row[‘keyword’] = $row[‘keyid’];
}
}
Common Words
$COMMON_WORDS is an associative array that stores an array of words, which are commonly used in the English Language. These words have to be removed while parsing the file.
$COMMON_WORDS=array(“a”=>1, “as”=>1);
You can add as many common words as you like. See source code for full list of common words.
ExtractWords() Function
This function filters words by allowing only alphabetic characters. To implement this, I used a technique called STATE MACHINE that filters the characters.
Alphabetic characters are taken as STATE1 and other characters (Numeric and Special Characters) as STATE0. Initially the machine will be in the STATE0. While parsing letters, if it encounters alphabetic characters, the machine switches to STATE1. Otherwise, it will remain in the same state. As a result we get a word with only alphabetic characters.

function ExtractWords($text){
$STATE0 = 0; //Numeric / Other Characters
$STATE1= 1; //Alpha Characters
$state = $ STATE0;
$wordList = array();
$curWord = "";
for ( $i = 0; $i < strlen($text); ++$i ) {
$ch = $text{$i};
$isAlpha = ctype_alpha( $ch );
if ( $state == $STATE0) {
if ( $isAlpha ) {
$curWord = $ch;
$state = $STATE1;
}
}
else if ( $state == $STATE1) {
if ( $isAlpha ) {
$curWord .= $ch;
}
else {
$wordList[] = strtolower( $curWord );
$state = $ STATE0;
}
}
}
if ( $state == $ STATE1) {
$wordList[] = strtolower( $curWord );
}
return $wordList;
}
As a result we get a list of words stored in an array returned to the calling function.
FilterCommonAndDuplicateWords() Function
This function is called after the ExtractWords() function. It parses filtered words and removes common words like ‘a’,’is’,’was’,’and’, etc. Other words are taken as valid words, their duplicates are removed, and are then stored in an associative array $wordMap. This array is returned to the called function.
function FilterCommonAndDuplicateWords( $wordList ) {
global $COMMON_WORDS;
global $MAX_WORD_LENGTH;
$wordMap = array();
foreach ( $wordList as $word ) {
$len = strlen( $word );
if ( ($len > 1) && ($len < $MAX_WORD_LENGTH) ) {
if ( !$wordMap[$word] ) {
if ( !$COMMON_WORDS[$word] ) {
$wordMap[$word] = 1;
}
}
}
}
Process Form function()
This is the core part of the upload program. After finishing filtering, removing common words and duplicate words, this function is called. Firstly, this function inserts the title and abstract in the content table. The newly generated content Id is stored in the $contented variable. It then updates the keyword and link tables.
For every word in the $wordMap array, if the word is already exists in the keywords table, it inserts the key ID and the content ID into the link table. Conversely, if the word is not found, it inserts the new word into then keyword table. The newly generated key ID is stored in the $keyed variable. It then updates the link table by inserting the key ID and content ID variables into the link table.
function ProcessForm($title ,$body){
global $allWords;
$tempWordList = ExtractWords( $body );
$wordList = FilterCommonAndDuplicateWords($tempWordList);
// insert into content
mysql_query( sprintf( "INSERT INTO content (title, abstract) VALUES ('%s', '%s')",
mysql_escape_string($title), mysql_escape_string($body) ) );
//store the newly generated content id in $contentId
$contentId = mysql_insert_id();
// insert all the new words and links
while(list($word,$val)=each($wordList)) {
$keyId = "";
if ( !$allWords[$word] ) {
mysql_query( sprintf( "INSERT INTO keytable ( keyword ) VALUES ( '%s' )",
mysql_escape_string($word) ) );
$keyId = mysql_insert_id();
$allWords[$word] = $keyId;
}
else {
$keyId = $allWords[$word];
}
// insert the link
mysql_query( sprintf( "INSERT INTO link (keyid, contid) VALUES ( %d, %d )", $keyId, $contentId ) );
}
//End of Processing Form.
}
?>
Next: The Article (contd.) >>
More MySQL Articles
More By Murali Dharan