Downloading complete web site in Java

Even if reusability is one of the most (re) used word, we, developers, often like to write code that has been written many times, just to see how to do it. That's why I wrote WebCatch, a small program that downloads complete web sites straight to your hard drive for offline browsing.

The first time I needed this kind of program, I started a search on the Internet looking for what has already been made. I quickly found more or less 100 applications doing that, some of them with a lot of functionalities, some of which were very basic. Then I decided to write such a tool myself, just to see how difficult it was.

UrlConnection makes life easy

And apparently, it was not. Java provides developers with a whole set of useful network classes, including all you need to handle URLs. URL stands for Universal Resource Locator and can be seen as the address of something on the Internet. A complete description of URLs are beyond the scope of this paper, the documentation of the class java.net.URL gives you a correct introduction.

Together with the numerous java.io.* classes which provide facility for handling stream of data, web connections are really easy.

Java provides you with two classes to handle URL and high-level connections to a web server:

The first one maps URLs and the second (guess what?) is an abstract class that maps connections to URLs.

In general, using URLs is a 3-steps process, first creating the instance of the URL class itself, then connecting to that URL in order to physically establish the network connection and then reading or writing to that URL.

URL.openConnection() is used to open the network connection and receive an instance of an URLConnection-derived class. URLConnection.inputStream() is used to receive an input stream to read the data until the End Of File (EOF). For our convenience, the URL class also provides a URL.openStream() method which directly returns an opened input stream. I do not use it here for educational purposes.

Figure 1 shows all this in action.

Figure n°1 : Basic Connection to a web server

Et voila! In less than 10 lines of codes, we connect to an URL and print its content on the screen. The next step is to save the URLs content on a local file. JDK 1.1 provides us with a convenient class for writing character-based streams: java.io.FileWriter. The constructor of this class takes a String representing a filename on the local file system and the method FileWriter.write(String line) writes a single line of text to that file.

Here is the code to save the URLs content on a file

Figure n°2: Writing on a file

Exploring links on a page

Unfortunately, life is never that easy. Each line of HTML code received by this code may contain links to images files or to other documents and so on… I wanted my program to bring back an entire web site, with all its HTML pages and graphics. Therefore, the application need to parse each line of HTML received to find links to other pages, explore those links and so forth.

This is a recursive algorithm that can be endless. The code from figure 1 is part of the following method

Where iDepth represents the level of exploration that the application has reached so far and iMaxDepth the maximum level allowed (indicated by the user as a command-line parameter)

Each time a link is found on a page, the method is called recursively, while iDepth<=iMaxDepth. As we reach the level where iDepth=iMaxDepth, the program does not explore any longer the URLs it finds and goes up in the call stack to finish parsing HTML from previous levels.

Parsing HTML

The process above might be familiar to some of you but how can HTML be parsed and what should we look for?

The second question is the easiest to answer, let’s start with this one. The program needs to find any links to images, files and frames. Table 1 below describes the HTML tags the program needs to search for.

HTML TagsDescription
<A HREF="">A hyperlink to another document
<FRAME SRC="">A link to a document inside a frame
<IMG SRC="">A link to an image file
<BODY BACKGROUND="">A link to a background image

Table 1: HTML tags indicating a link to follow or an element to retrieve.

Links to other documents must be followed recursively as explained before. However, links to images do not need to be explored, the program just need to get the image. The program uses the same technique as the one used for documents, except that it does not use BufferedReader (designed for character-base streams) but a simple InputStream, well suited for pure binary streams.

As images may be big files, the program spawns a new thread to retrieve images asynchronously. This technique allows the program to continue exploring the web site while downloading images separately. A complete thread discussion is beyond the scope of this article.

Now that we know what to look for, how to look it? Once again, Java provides us with a useful set of classes. The application uses java.io.StreamTokenizer. This class provides method to parse any type of stream looking for tokens and each time it finds the correct sequence of tokens, it extracts the URL that follows. I let you examine the JDK documentation and the source code to see exactly how it works.

Three more difficulties

Ok, now we are almost there. The program retrieves entire HTML documents, parses the text, explores documents linked to the first one and retrieves images as well.

There are still some problems: filename collision, relatives URL and already visited URLs.

Filenames

Filenames on a web site may have same names under different directories, the program needs to re-create the directory structure of the original web site to avoid name collisions.

Therefore, the program uses the path part of the URL to create, on the local drive, the directory structure of the visited web site.

Relatives URLs

Links inside an HTML document are like a box of chocolate: you never know what you gonna get! There are three kinds of links inside a document, summarized in table 2.

Each of those must be handled differently when parsing HTML.

Type of linksDescription
http://host_name/path/filenameComplete URL reference
pathname/filenameRelative reference, starting in the current directory
/pathname/filenameRelative reference, starting at the root of the site we are visiting

Table 2: Different types of references in HTML

When parsing complete URL references, the new URL is the one found in HTML. When parsing relative URLs, it should be appended to the current URL name. And when parsing URL relatives to the root of the site, it should be appended to the host name part of the original URL.

To help us to handle URL, the java.net.URL class provides the following methods:

Which respectively returns an URL protocol, hostname and file name.

But the getFile() method returns the complete file reference, including the path name. And we need, to easily handle all our three different kinds of links, a getFile() method that returns just a filename and a getPath() method that returns just a pathname. My first idea was to subclass the URL class and to overwrite the necessary methods. Bad luck! java.net.URL is declared as a final class and can therefore not be subclassed. In a certain way, it makes sense.

I created a SmartURL class, which is merely a wrapper for the java.net.URL functionality I needed and which adds some new functions as SmartURL.getPath() and SmartURL.getFile() respectively to return only the path part of an URL and only the filename part of an URL

Finally, the links on the HTML documents need to be replaced by a link to the file on the local file system. When retrieving the document, all links are relative to the original web site. In order to makes the retrieved pages "browsable" on your local hard drive, all those references need to be replaced by new ones.

Visited URLs

The last problem I ran into is that HTML documents may link or may contain many references to the same document or image file. It is really a waste of resources and time to visit those documents and images several times. The program therefore needs to keep an history of the URLs it has already visited. I use the java.util.HashTable class to store the URL the program visits and to quickly look up if a given URL has already been visited or not. I will let you dive inside the source code to see how it works.

Conclusions

WebCatch connects to URLs, parses the entire document looking for links. It explores any link it finds up to n level. It also retrieves images (using a different thread for performance reasons) and replaces URLs so that the links continue to work on your local hard drive.

The funniest part of the application is the web connection. Java provides developers with a whole set of high-level classes making this job really easy. The most time-consuming tasks was the handling of the URLs, directories, link replacement and so forth. Java also comes to our help in this part with classes like HashTable, StreamTokenizer,… The whole application is rather small: 6 pages of codes and only 13