954,598 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Plan for parsing HTML

Hi! I just inherited a rather large legacy site here at work that has no database behind it. It's a large volume of HTML pages with the content written right into the HTML page. I need to extract the content and bring it into a database, or XML files.

Each section of the HTML pages has header tag and a standard title, so I'm thinking I should write a perl script to parse the pages based on header tags and insert them into MYSQL.

Before I begin, I thought I'd check with you guys to see if you have had any similar experience and recommendations.

Thanks!

Tom Tolleson

Tom Tolleson
Light Poster
39 posts since Oct 2007
Reputation Points: 10
Solved Threads: 0
 

I've done the same, only with PHP. Using a regex I stripped out the actual content and put it into the DB. You may need to escape the content, but that depends on your insertion method and column type.

pritaeas
Posting Expert
Moderator
5,484 posts since Jul 2006
Reputation Points: 653
Solved Threads: 875
 

I did it with notepad.

I used find and replace to replace each tag with either nothing or the separators needed to import the data into the database.

MidiMagic
Nearly a Senior Poster
3,319 posts since Jan 2007
Reputation Points: 730
Solved Threads: 182
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You