Home arrow HTML arrow Building A Search Engine
HTML

Building A Search Engine


Have you ever wanted a search engine on your site but didn't know how to do it? If so, read this article and see for yourself how Daniel Solin creates a search engine using PHP.

Author Info:
By: Daniel Solin
Rating: 4 stars4 stars4 stars4 stars4 stars / 111
March 24, 2003
TABLE OF CONTENTS:
  1. · Building A Search Engine
  2. · Summary

print this article
SEARCH DEVARTICLES

Building A Search Engine
(Page 1 of 2 )

A little while ago, I was working on a intranet-site for a mid-size company. As the site grew in both size and popularity, the assigner requested me to extend the site with a search-feature. And, since one of the general criterias on the intranet was that all logic code should be written in-house, using one of the open source engines was not an option.

After about a day's work, the engine was quite complete, and the result actually turned out better than expected. By using a few simple techniques, and with the help of PHP and MySQL, these kind of projects are really not a problem anymore. In this article, I will present to you a cut-down version of the search engine that was developed. I hope this will get you started, and give you a few good ideas how to develop a search-engine of your own. With the example used in this article in hand, you could easily develop an engine that suits your particular needs, with the exact features that you would like to have.

Database Design And Logic

The database for the search engine consist of three table: page, word and occurrence. page holds all web pages that has been indexed, and word holds all words that has been found on the indexed pages in page. The rows in table occurrence consists of references to rows in page and word. Each row representing one occurrence of one particular word on one particular page. The SQL for creating these tables are shown below.

CREATE TABLE page (
page_id int(10) unsigned NOT NULL auto_increment,
page_url varchar(200) NOT NULL default '',
PRIMARY KEY (page_id)
) TYPE=MyISAM;

CREATE TABLE word (
word_id int(10) unsigned NOT NULL auto_increment,
word_word varchar(50) NOT NULL default '',
PRIMARY KEY (word_id)
) TYPE=MyISAM;

CREATE TABLE occurrence (
occurrence_id int(10) unsigned NOT NULL auto_increment,
word_id int(10) unsigned NOT NULL default '0',
page_id int(10) unsigned NOT NULL default '0',
PRIMARY KEY (occurrence_id)
) TYPE=MyISAM;

As you see, while page and word hold actual data, occurrence act only as a reference table. By joining occurrence with page and word, one can find out on which page(s) a certain word is mentioned, as well as how many times it is mentioned. We'll get back to that a little later, however - first we need to populate the database so that we have some content to work with.

Populating The Database

Okay, the database is created and we're ready to feed it with some content. For this, we'll create a PHP-script that takes a user-specified URL, reads the document representing the URL, and creates records in the database based on the words it extracts from the document. Take a look at the listing below.

<?
/*
* populate.php
*
* Script for populating the search-database with words,
* pages and word-occurences.
*/
/* Connect to the database: */
mysql_pconnect("localhost","root","secret")
    or die("ERROR: Could not connect to database!");
mysql_select_db("test");
/* Define the URL that sould be processed: */
$url = $_GET['url'];
if( !$url )
{
    die( "You need to define a URL to process." );
}
else if( substr($url,0,7) != "http://" )
{
    $url = "http://$url";
}
/* Does this URL already have a record in the page-table? */
$result = mysql_query("SELECT page_id FROM page WHERE page_url = \"$url\"");
$row = mysql_fetch_array($result);
if( $row['page_id'] )
{
    /* If yes, use the old page_id: */
    $page_id = $row['page_id'];
}
else
{
    /* If not, create one: */
    mysql_query("INSERT INTO page (page_url) VALUES (\"$url\")");
    $page_id = mysql_insert_id();
}
/* Start parsing through the text, and build an index in the database: */
$fd = fopen($url,"r");
while( $buf = fgets($fd,1024) )
{
    /* Remove whitespace from beginning and end of string: */
    $buf = trim($buf);
  
 /* Try to remove all HTML-tags: */
    $buf = strip_tags($buf);
    $buf = ereg_replace('/&\w;/', '', $buf);
    /* Extract all words matching the regexp from the current line: */
    preg_match_all("/(\b[\w+]+\b)/",$buf,$words);
    /* Loop through all words/occurrences and insert them into the database: */
    for( $i = 0; $words[$i]; $i++ )
    {
        for( $j = 0; $words[$i][$j]; $j++ )
        {
            /* Does the current word already have a record in the word-table? */
            $cur_word = strtolower($words[$i][$j]);
            $result = mysql_query("SELECT word_id FROM word WHERE word_word = '$cur_word'");
            $row = mysql_fetch_array($result);
            if( $row['word_id'] )
            {
                /* If yes, use the old word_id: */
                $word_id = $row['word_id'];
            }
            else
            {
                /* If not, create one: */
                mysql_query("INSERT INTO word (word_word) VALUES (\"$cur_word\")");
                $word_id = mysql_insert_id();
            }
            
/* And finally, register the occurrence of the word: */
            mysql_query("INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id)");
            print "Indexing: $cur_word<br>";
        }
    }
}
fclose($fd);
?>

Basically, this script connects to the database, registers the URL (the page) in the database (if it's not already there), starts to retrieve data, uses the preg_match_all()-function to extract the words from the page, and then creates a record in the occurrence-table and/or the word-table for the currently processed word. So, for example, if the script finds the word 'linux' on http://www.onlamp.com, it will execute the following INSERT-statements:

INSERT INTO page (page_url) VALUES ("http://www.onlamp.com");
INSERT INTO word (word_word) VALUES ("linux");
INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id);

However, this is only true if http://www.onlamp.com has not been indexed yet, and that this occurence of 'linux' is the first one. If 'linux' occurs once more further down on the page, the two first statements will not get executed, and the 'old' page_id and word_id will be used again.

Let's now index a few pages. The seven sites that makes up the O'Reilly Network is probably a good idea. So, call populate.php with your browser using the site URLs as the only argument, one at a time:

http://localhost/populate.php?url=http://www.macdevcenter.com
http://localhost/populate.php?url=http://www.onjava.com
http://localhost/populate.php?url=http://www.onlamp.com
http://localhost/populate.php?url=http://www.openp2p.com
http://localhost/populate.php?url=http://www.osdir.co
http://localhost/populate.php?url=http://www.perl.com
http://localhost/populate.php?url=http://www.xml.com

A quick investigation of the tables now should result in something like this:

mysql> SELECT * FROM page;

 page_id  page_url 
1
2
3
4
5
6
7
http://www.macdevcenter.com
http://www.onjava.com
http://www.onlamp.com
http://www.openp2p.com 
http://www.osdir.com
http://www.perl.com
http://www.xml.com

7 rows in set (0.00 sec)

mysql> SELECT COUNT(*) FROM word;

 COUNT(*)
 2423

1 row in set (0.00 sec)

mysql> SELECT COUNT(*) FROM occurrence;

 COUNT(*)
 20628

1 row in set (0.00 sec)

There you go - the database now have indexed seven pages, resulting in 2423 unique words and 20628 occurrences. We're ready for the next-level - the search-interface.

The Search Interface

Of course, users of the search-engine will not be able to work with the MySQL database directly. Therefore, we'll create another PHP-script that makes it possible to query the database through a HTML-form. This will work just as any other search-engine: the user enters a word in a text-box, hits Enter, and the interface presents a result-page with links to the pages which contains the word that was searched for. In this example, the order in which the pages are presented is settled by the number of times the keyword appears in each document. The search.php-script are listed below

<?
/*
* search.php
*
* Script for searching a datbase populated with keywords by the
* populate.php-script.
*/
print "<html><head><title>My Search Engine</title></head><body>\n";
if( $_POST['keyword'] )
{
    /* Connect to the database: */
    mysql_pconnect("localhost","root","secret")
        or die("ERROR: Could not connect to database!");
    mysql_select_db("test");
    /* Get timestamp before executing the query: */
    $start_time = getmicrotime();
    /* Execute the query that performs the actual search in the DB: */
    $result = mysql_query(" SELECT
                                p.page_url AS url,
                                COUNT(*) AS occurrences
                            FROM
                                page p,
                                word w,
                                occurrence o
                            WHERE
                                p.page_id = o.page_id AND
                                w.word_id = o.word_id AND
                                w.word_word = \"".$_POST['keyword']."\"
                            GROUP BY
                                p.page_id
                            ORDER BY
                                occurrences DESC
                            LIMIT ".$_POST['results'] );
    /* Get timestamp when the query is finished: */
    $end_time = getmicrotime();
    /* Present the search-results: */
    print "<h2>Search results for '".$_POST['keyword']."':</h2>\n";
    for( $i = 1; $row = mysql_fetch_array($result); $i++ )
    {
        print "$i. <a href='".$row['url']."'>".$row['url']."</a>\n";
        print "(occurrences: ".$row['occurrences'].")<br><br>\n";
    }
    /* Present how long it took the execute the query: */
    print "query executed in ".(substr($end_time-$start_time,0,5))." seconds.";
}
else
{
    /* If no keyword is defined, present the search-page instead: */
    print "<form method='post'>Keyword: <input type='text' size='20' name='keyword'>\n";
    print "Results: <select name='results'><option value='5'>5</option>\n";
    print "<option value='10'>10</option><option value='15'>15</option>\n";
    print "<option value='20'>20</option></select>\n";
    print "<input type='submit' value='Search'></form>\n";
}
print "</body></html>\n";
/* Simple function for retrieving the currenct timestamp in microseconds: */
function getmicrotime()
{
    list($usec, $sec) = explode(" ",microtime());
    return ((float)$usec + (float)$sec);
}
?>

There are two possible ways this script can get executed: with or without the keyword-argument. If keyword is defined, the script tries to find that particular word in the database, and presents the result on-screen. Additionally, the script also shows how long it took for the database to come up with the result. If keyword is not defined, the search-page is presented instead. Enter the URL to this script in your browser, and you should see something like Figure 1.

Let's search on the keyword 'linux' then. You should now see something like Figure 2.

As expected, onlamp.com is presented first on the result-page. As stated, this is simply because the keyword 'linux' appears more frequently on this site than on the others. A search for 'java' would probably get onjava.com on the top, and 'xml' would most likely generate most hits on xml.com. Also note that we selected to limit the number of results to five on the search-page, which makes only the five most interesting pages to show up in the result.

Speeding Up The Database

As you also see on the bottom of the result-page, the query took 0.393 seconds to execute. While this may not seem like an incredible long time, it would make the searches unreasonable slow as the database grows. Fortunately, there's a very simple trick you can do to speed this up. Create an index on the word_word-column:

CREATE INDEX word_word_ix ON word (word_word);

This will create an index on the column word_word in the word-table. Since this column is used every time a search is made with our search-engine, this will probably make our engine much faster. Let's try it out - search for the keyword 'linux' again, to see if we gained any performance. See Figure 3.

Nice! 0.028 seconds - a speed-increase of 0.365 seconds, or 1400%. If this engine would handle an average of 1000 queries per hour, this would mean a save in query execution-time of about 144 minutes per day.


blog comments powered by Disqus
HTML ARTICLES

- Does HTML5 Need a Main Element?
- Revisiting the HTML5 vs. Native Debate
- HTML5: Not for Phone Apps?
- HTML5 or Native?
- Job Hunting? Freelancer.com Lists This Quart...
- HTML5 in the News
- Report: HTML5 Mobile Performance Lags
- The Top HTML5 Audio Players
- Top HTML5 Video Tutorials
- HTML5: Reasons to Learn and Use It
- More of the Top Tutorials for HTML5 Forms
- MobileAppWizard Releases HTML5 App Builder
- HTML5 Boilerplate: Working with jQuery and M...
- HTML5 Boilerplate Introduction
- New API Platform for HTML5

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials