As a learning project I'm rewriting a spider i have in php to c++.
One part of the spider is a multi-threaded downloader, all this downloader does is read urls from a text file download the page and save it.
I would like the downloader to strip out all the urls in a page it downloads so that i can store those in my que and do the real parsing of the pages later.
Now my question is, should i use tidy to clean the html and then use an xml parser to get what i want?
Or should i for this simple case(i only need urls from a page) just use regular expressions?
As a side note I'm aiming for speed and as much library and Os independent as possible.
This all may sound as a very easy question but i always heard everybody say that one should NOT use regex on html but in this case tidy + xml parsing seems like allot of overhead to me.