A few days ago i came across this problem while trying to extract data from this database. It contained company information, addresses, names, emails, number of employees and such.
Naturally I was looking for an easy way to get all the information from the database as fast and as cheap as possible (and hopefully legal). But apparently this database generates a profit by selling their data in huge chunks to companies and other interested parties. Since buying was not an option (about 20.000€ for 100.000 entries) I had to find a different approach.
Fortunately they allow you to send a query to their database via a html form where you can specify the kind of data you want. You can search for a company in a specific industry, having a certain prefix in their name and, of course, a company located at a specific address/postal code.
The solution was simple. To get all the data from the database, without having to pay for it, i had to frequently send queries via their form to the database and fetch the results. This could be easily done with a web scrapper.
Since i wanted to use java as a programming language for my endeavors, I had to find a good library. The website looked a little bit like this (asp.net):
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/..." />
<input type="text" name="postal-code"/>
If you do a little search on Google you find 2 libraries that are made for web scrapping with java:
If you are looking for an extremely fast and lightweight library you want to go with Jaunt. It is very easy to get (it’s free). Just download it and add the .jar to your project (for example in eclipse) and you are good to go. You can find an easy example here and the documentation is here.
After some searching I came across this little library. Their website looks like it was developed around 1995, but it does it’s job just fine. It is a “GUI-Less browser for Java programs” and does exactly that. It goes to the website you specify and interprets it as a normal web browser that you would normally use.
In my case I didn’t care about detection or measures against that. I was mainly focused on how to get the data as fast and easy as possible.
Unfortunately the documentation is pretty bad and you wont find an easy copy and paste working example. Don’t worry though, I am here to help you out. This piece of code will open a connection to a website, get the content and display the title.
WebClient client = new WebClient(BrowserVersion.CHROME);
HtmlPage page = client.getPage("www.sahsec.com");
String headline = page.getTitleText();
You of course have to download and add the jar file to your project first and then import all the necessary classes.
So how do you decide which library to use?