How can I convert (move) individual HTML webpages in bulk into mysql forum?
What I have is just the 1000s of html pages which consist of question and answers. Now how can I move then into into mysql forum.
This is how my question/answer html pages looks like this:- drive.google.com/file/d/0B4RFn3PmjNy9YTNkbE5mejBfV1U/view?usp=sharing

So now what I have to do exactly? I just want to move the questions and answers. Don't want user login details or anything else. I have thousands of pages like this

The question title is in between the tags < title> and < /title> (also present in between < h1> and < /h1> tags).

The question description is in between the < div id="summaryDescription">

All the answer descriptions are in between div classes < div class="postContent">

Only these above 3 are I needed [I don't want the user informations, it can be null or just from the account named anonymous).

Now how can I create the php script (I tried reading this simplehtmldom.sourceforge.net/manual.htm but I can't understand any).

How can I make an sql query out for all those files? Then how to send them to server via shell?

As for parsing, what should I have to do? Run this in where? Are these correct?

<title><?php echo $data['title'] ?></title>,
echo file_get_div('summaryDescription')->plaintext;
echo file_get_div('postContent')->plaintext;

Please guide me. Thanks in advance.

Recommended Answers

All 2 Replies

While some don't want links about such, my thought here is that you'll have to code this up.

But how to get your web pages into files for you to process later with your parser then later to the the database?

I'm thinking cURL. As in https://en.wikipedia.org/wiki/CURL

So the psuedo code would be get all the pages to files with a script that curled the content.
Next you parse it all with your own design to hold the content in some text file ready for the final step.
Finally you write an app to read all those results and file them into your database.

Member Avatar for diafol

simplehtmldom should be able to eat this for breakfast.

1) install simplehtmldom
2) loop files to read (e.g. SPL FilesystemIterator or SPL DirectoryIterator or glob or even scandir)
3) in the loop use simplehtmldom to pull data into an array.
4) insert array elements to DB record

Depending on the number and size of files, you may hit the PHP timeout limit. A workaround would be to use a PHP CLI script - however, you may not have access to this if on a shared host, but you could still run your script aas a cron job.

Once your data is in the DB, you just pull it into a template as required.

//EDIT

Hold on. Just re-read. These pages aren't on your server? In which case you'll probably have to do this by writing all the filenames into an array and looping on that since you won't be able to loop on remote server. Or if there is an index page with a list of files, you can use simplehtmldom to scrape the files names into an array for you.

I am assuming that these pages are yours and that you are not simply stealing somebody else's work. If yes - go for it. If no - there's a special place in hell for people like that :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.