We're a community of 1076K IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,075,621 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

Generalized Web Scraping

I am an undergraduate Student, from Computer Science and engineering department

I can construct a crawler in Perl, for one particular web-site to fetch the useful information, in my case the - Job Ads at that company's webpage.

Now, I want to construct some crawler that is generalized for say around 100 companies, using Perl

How can I do it ? I need some ideas/code/resource... and Do I need to study all 100 HTML codes?

Regards,
Kunal

2
Contributors
3
Replies
2 Weeks
Discussion Span
3 Years Ago
Last Updated
4
Views
hopewemakeit
Newbie Poster
2 posts since May 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

Look into

HTML::Parse

. It is event driven and tag driven. You create functions when a tag opens or closes and how to deal with it. Play around with it and see if you can more effectively parse HTML with it.

mitchems
Posting Whiz in Training
295 posts since Feb 2009
Reputation Points: 26
Solved Threads: 38
Skill Endorsements: 0

I apologize in advance for back-posting. I meant

HTML::Parser

HTML::Parse is deprecated
mitchems
Posting Whiz in Training
295 posts since Feb 2009
Reputation Points: 26
Solved Threads: 38
Skill Endorsements: 0

hi,
thanks for the post

I have made crawlers for one web-site and it really is based on the Job-portal on that site and its HTML coding.. as in , like for what HTML tag opens and closes, and accordingly the data retrieval.,in between them (the one i need)

But I really cant figure it out, there are 100 web pages before me and I need to create a common scraper and all the HTML codes/tags are different.

hopewemakeit
Newbie Poster
2 posts since May 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

This article has been dead for over three months: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
 
© 2013 DaniWeb® LLC
Page rendered in 0.5197 seconds using 2.66MB