Hi! I just inherited a rather large legacy site here at work that has no database behind it. It's a large volume of HTML pages with the content written right into the HTML page. I need to extract the content and bring it into a database, or XML files.

Each section of the HTML pages has header tag and a standard title, so I'm thinking I should write a perl script to parse the pages based on header tags and insert them into MYSQL.

Before I begin, I thought I'd check with you guys to see if you have had any similar experience and recommendations.

Thanks!

Tom Tolleson

Recommended Answers

All 2 Replies

I've done the same, only with PHP. Using a regex I stripped out the actual content and put it into the DB. You may need to escape the content, but that depends on your insertion method and column type.

I did it with notepad.

I used find and replace to replace each tag with either nothing or the separators needed to import the data into the database.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.