943,733 Members | Top Members by Rank

Ad:
Dec 3rd, 2008
0

Plan for parsing HTML

Expand Post »
Hi! I just inherited a rather large legacy site here at work that has no database behind it. It's a large volume of HTML pages with the content written right into the HTML page. I need to extract the content and bring it into a database, or XML files.

Each section of the HTML pages has header tag and a standard title, so I'm thinking I should write a perl script to parse the pages based on header tags and insert them into MYSQL.

Before I begin, I thought I'd check with you guys to see if you have had any similar experience and recommendations.

Thanks!

Tom Tolleson
Last edited by Tom Tolleson; Dec 3rd, 2008 at 12:02 pm. Reason: typo
Similar Threads
Reputation Points: 10
Solved Threads: 0
Light Poster
Tom Tolleson is offline Offline
39 posts
since Oct 2007
Dec 4th, 2008
0

Re: Plan for parsing HTML

I've done the same, only with PHP. Using a regex I stripped out the actual content and put it into the DB. You may need to escape the content, but that depends on your insertion method and column type.
Sponsor
Featured Poster
Reputation Points: 550
Solved Threads: 728
Bite my shiny metal ass!
pritaeas is offline Offline
4,166 posts
since Jul 2006
Dec 8th, 2008
0

Re: Plan for parsing HTML

I did it with notepad.

I used find and replace to replace each tag with either nothing or the separators needed to import the data into the database.
Reputation Points: 730
Solved Threads: 181
Nearly a Senior Poster
MidiMagic is offline Offline
3,314 posts
since Jan 2007

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in HTML and CSS Forum Timeline: Text box mouseover
Next Thread in HTML and CSS Forum Timeline: Frames and Framesets or ???





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC