Hi all,

As a learning project I'm rewriting a spider i have in php to c++.
One part of the spider is a multi-threaded downloader, all this downloader does is read urls from a text file download the page and save it.
I would like the downloader to strip out all the urls in a page it downloads so that i can store those in my que and do the real parsing of the pages later.
Now my question is, should i use tidy to clean the html and then use an xml parser to get what i want?
Or should i for this simple case(i only need urls from a page) just use regular expressions?

As a side note I'm aiming for speed and as much library and Os independent as possible.
This all may sound as a very easy question but i always heard everybody say that one should NOT use regex on html but in this case tidy + xml parsing seems like allot of overhead to me.


Or not even use a regex at all.

The key here is make a decent design to begin with, then you should be able to swap in and out different implementation ideas later on without having to rewrite the whole thing.

Use whatever you're most comfortable with, just so you can get something working.

Overly optimising the C++ code could be a waste of time, if your bottleneck remains network bandwidth.

I would love not to use regex at all but beside regex or tidy i don't know how to achieve what i want. Actually I'm NOT comfortable with both and have very little knowledge/experience parsing and would love to learn about other options so if you have any links or anything please let me know.

With most spiders speed is not really that important due to the bandwidth limitations and using a "slow" spider is even preferable but therefor mine has a max per minute/hour/day limiter. But since it will run on a intranet where bandwidth is not a problem as well i do want it to be as fast as possible.

This article has been dead for over six months. Start a new discussion instead.