Hi, I am new to here. I am doing with my honours project in University, it is about data mining. Before I can do data mining, first I need data.

I want to extract data from a site : www.tripadvisor.com
Should I write my web crawler with python? I don't know python but I've seen people using this to do it.

I don't need the extracted hyperlinks(Although I need to extract them in the process), I only want to extract the words(String) within the pages. Can python to that? Are there any help with me writing that?


Thank you very much.


Raymond

Recommended Answers

All 6 Replies

You have a lot of options. If you're looking to purely look at html you can use urllib2, or if you'd rather have the module parse out all the elements for you and give you purely the text data you'd be better off using beautifulsoup. Search this forum to find plenty of examples of using both.

Yes it can grab specific links, but it can't do it without instructions. You are going to have to tell it where to crawl and why. I'm not sure what you mean by that link is encrypted, but to me it looks like it could have valuable information that you could use in your logic. Such as "showuserreviews", "place", "hongkong". Possibly extract all the links then look for the ones with these kinds of keywords and use that to decide where to go next.

Try Scrapy.

It's a very simple (though quite powerful) web crawling and screen scraping framework for Python. It's also pretty well documented, and has a growing community.

I've already be able to use BeautifulSoup to write it.
Thank you all!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.