Hi,

I have to use Java to build a scrapper, that takes a product name and searches for that product on an e-commerce website. I understand that I will have to use an external library that would parse the html page for me, given the link.
But how do I make my code to search for a particular product on the site? as in it would have to type product name on the search bar and hit search.

I am new to Java, any sort of help would be greatly appreciated.

- Keeda

Unfortunately no easy way of doing this. There are two contenders in Java land (assuming you are talking about actually interacting with pages rather than just scraping the web page ignoring the JS part): HtmlUnit and Selenium.

HtmlUnit works pretty good for simple and compliant web pages. I have successfully used it for scraping simple sites like Reddit but failed to find any use when working with JS heavy sites. Plus I've heard bad sort of bug encounters when faced with quirky HTML markup. No generic support when returning collections means lots of casts but I'm sure this is a minor complaint.

Selenium, at least for me, was a pain to set up. Plus AFAIK it is something mainly aimed towards browser driven web application test rather than "scraping" though it can be certainly used as one. I have used it for automating "field" filling and clicks and it has worked out pretty well with code shorter than HTMLUnit. It spawns up a browser instance for simulating your script which I'm not really a fan of but if you can live with it, good for you.

If you are not afraid to ditch Java, I've heard good things about PhantomJS which uses a headless Webkit browser to do its bidding. Plus you get to write scraping scripts in Javascript (not sure if that's a plus for you).

Edited 5 Years Ago by ~s.o.s~: n/a

This article has been dead for over six months. Start a new discussion instead.