Web scrapping with Java

A few days ago i came across this problem while trying to extract data from this database. It contained company information, addresses, names, emails, number of employees and such.

Naturally I was looking for an easy way to get all the information from the database as fast and as cheap as possible (and hopefully legal). But apparently this database generates a profit by selling their data in huge chunks to companies and other interested parties. Since buying was not an option (about 20.000€ for 100.000 entries) I had to find a different approach.

Fortunately they allow you to send a query to their database via a html form where you can specify the kind of data you want. You can search for a company in a specific industry, having a certain prefix in their name and, of course, a company located at a specific address/postal code.

The solution was simple. To get all the data from the database, without having to pay for it, i had to frequently send queries via their form to the database and fetch the results. This could be easily done with a web scrapper.

Since i wanted to use java as a programming language for my endeavors, I had to find a good library. The website looked a little bit like this (asp.net):

I overly simplified the problem (everything else would just be unreadable…). But as you can see we have some serious javascript work going on here. Basically what happens is that you click on “Address” and the div, with the input fields concerning the address, opens. While this is very easy for us humans to understand and do, a web scrapper has to have the ability to understand and execute javascript in order to even see the input fields.

If you do a little search on Google you find 2 libraries that are made for web scrapping with java:

Jaunt:

If you are looking for an extremely fast and lightweight library you want to go with Jaunt. It is very easy to get (it’s free). Just download it and add the .jar to your project (for example in eclipse) and you are good to go. You can find an easy example here and the documentation is here.

The only drawback is it’s non existing support for javascript – which made it useless for my use case.

HtmlUnit:

After some searching I came across this little library. Their website looks like it was developed around 1995, but it does it’s job just fine. It is a “GUI-Less browser for Java programs” and does exactly that. It goes to the website you specify and interprets it as a normal web browser that you would normally use.

The advantage of this is clear: You are able to see and execute javascript code and content and the website won’t detect via your user agent (which you can specifically set by the way – alongside other parameters so it looks like a totally normal “user”).

In my case I didn’t care about detection or measures against that. I was mainly focused on how to get the data as fast and easy as possible.

Unfortunately the documentation is pretty bad and you wont find an easy copy and paste working example. Don’t worry though, I am here to help you out. This piece of code will open a connection to a website, get the content and display the title.

You of course have to download and add the jar file to your project first and then import all the necessary classes.

So how do you decide which library to use?

Well, you can use HtmlUnit for everything since it supports javascript and therefor works on almost every page (not on flash sites – but no one likes those anyway…). But if you want to extract data from a very simple and static html/xml page you are better off with jaunt since it is easier to use and lighter in terms of size and code you need to achieve results.

Leave a Reply

Your email address will not be published. Required fields are marked *