954,525 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Web Scraping Help

Hello,
I am working on a project and I need to go to multiple news web sites and get articles pertaining to stock numbers. My ideas so far have been to download an RSS file from somewhere like Google Finance, and then extract the links out of there, follow them, get just the article section and then store it into a database. The problem I am seeing is that the sites are structured too differently for me to write something that can accomplish this. I am looking for help on getting a little more advanced with the scraping and wondering if someone could maybe recommend some perl modules that might make this a little easier.

Thanks for the help!
--
Nick

stupidenator
Junior Poster
192 posts since Mar 2005
Reputation Points: 18
Solved Threads: 4
 

search CPAN for RSS modules. I have no specific recommendations.

KevinADC
Posting Shark
921 posts since Mar 2006
Reputation Points: 246
Solved Threads: 67
 

WWW::Mechanize is the de facto module for scraping, and other tasks. Beware though if the target site contains JavaScript, as Mechanize will not execute it.

Also see http://www.research.att.com/sw/tools/wsp/

And FEAR::API at CPAN.

trudge
Junior Poster
178 posts since Sep 2007
Reputation Points: 18
Solved Threads: 20
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You