Have you ever wanted to copy a Website for offline browsing or backup purposes? There’s a powerful UNIX tool called wget that can do this, and much more. I’ll review a simple example of using this tool, and discuss some advanced features that are huge timesavers. GNU wget is a free utility which runs under UNIX and Windows. In a nutshell, this program can go out and effectively mirror a Website for local browsing or backup purposes. While it has more powerful features, this article will focus on the basics of the tool.
Website Mirroring With wget - Using wget (Page 2 of 3 )
Wget is a command line utility. So, UNIX users must run it from a shell, and Windows users need to open a “command prompt” window to run it. The examples below will assume you are using Windows, but the commands apply just as easily to other platforms.
Before we start a word of caution; keep in mind when mirroring a Website or FTP site. You are consuming significant resources, both in terms of bandwidth and server processing. Please be considerate when using wget, and please make sure you have permission before setting up recurring or intensive mirroring of a large site. Wget does have some features to mitigate the impact on remote servers, namely the –w option discussed in the examples below. Please use your best judgment!
Example 1 – Mirror a Website for Offline Browsing.
This example will mirror a Website with all images, etc. to your local machine. Keep in mind that any dynamic content will become “static” on the local copy. In this example, I’ll be forcing any non-HTML extensions (.CGI, .ASP, etc.) to be written as HTML files. This will facilitate local browsing. Also note, this example does not retrieve the source for any scripts or server side code. The second example will illustrate how to do that.
Ok, let’s try it out. Assuming wget is in your path (if not, you’ll have to cd into the C:\Program Files\wget directory), issue the following commands:
That’s it! In a few seconds (or minutes, depending on the size of the site and speed of your connection), you’ll have the site downloaded. It will be in a folder called www.yourdomain.com. Simply open a web browser, choose File -> Open, and browse to the index.HTML (or appropriate starting page) in the folder that was just created by wget, in our case (C:\wget_file\www.yourdomain.com\example1). This tree is suitable for offline browsing – try it when visiting a client who does not have high speed network connectivity – or even burning to CD for posterity.
Now, a brief explanation of the options used:
--mirror: Specifies to mirror the site. Wget will recursively follow all links on the site and download all necessary files. It will also only get files that have changed since the last mirror, which is handy in that it saves download time.
-w: Tells wget to “wait” or pause between requests, in this case for 2 seconds. This is not necessary, but is the considerate thing to do. It reduces the frequency of requests to the server, thus keeping the load down. If you are in a hurry to get the mirror done, you may eliminate this option.
-p: Causes wget to get all required elements for the page to load correctly. Apparently, the mirror option does not always guarantee that all images and peripheral files will be downloaded, so I add this for good measure.
--HTML-extension: All files with a non-HTML extension will be converted to have an HTML extension. This will convert any CGI, ASP or PHP generated files to HTML extensions for consistency.
--convert-links: All links are converted so they will work when you browse locally. Otherwise, relative (or absolute) links would not necessarily load the right pages, and style sheets could break as well.
-P (prefix folder): The resulting tree will be placed in this folder. This is handy for keeping different copies of the same site, or keeping a “browsable” copy separate from a mirrored copy.
Note: These files should not be uploaded back to your server – they have been modified for local viewing, and will most potentially break your Website if you blindly upload them!
Example 2 – Copy Your Site for Backup Purposes
This example will create a local copy of your site that is suitable for backup purposes. If you already use a Website management tool, such as Dreamweaver, this method may get files that it does not have, for example, log files, CGI files, and other data files created on the server.
Note: If your site uses a database server, such as mysql or SQL Server, this method will not backup the actual data. Backing up databases is beyond the scope of this article.
Using the same assumptions from the first example, we’ll create a mirror via FTP. You will need to know your FTP username and password, and the FTP host to access your site files. These are likely the same as you use to update your site via Dreamweaver or GoLive, etc. Here’s the full command (all on one line):
Once this is done, take a look at the files pulled down. Note, you will likely not be able to browse this as you could with Example 1, since none of the links or files were converted. The options for wget work the same as in Example 1.
To keep this mirror in sync, simply run it by hand every so often, or set up a cron (UNIX) or at (Windows) job to run it on a regular basis.
Example 3 – Download a File
While web browsers and FTP programs can handle downloads, on occasion, wget comes in handy in this regard. It supports resumption of interrupted downloads, and of course, the command line aspect is useful. Here is a simple example (create the “example3” folder as before):
To continue an interrupted download, add a “-c” option before the –P. Note, this will only work if the remote server supports resuming. If so, it can save quite a bit of downloading time, especially over a slow connection.