Is there a c++ library, preferably an STL, that can parse for regular expressions? Or is there an open source standard expression that I could preferably just include in a header instead of having to link against?

I've never been good with regular expressions though. What I'm trying to do is parse out links from HTML source. This is what I've been working on. I haven't tested it though, does this look reasonable? Also, is there a specific way to make it return all matches or is that implementation dependent?

<[:blank:*,^*]*a[:blank:*,^*]+href[:blank:*,^*]*=[:blank:*,^*]*".+">

Edit: I don't know how to prevent it from display things as smileys so I put it in a code block.

Edited 6 Years Ago by Falmarri: n/a

Comments
Wow! Someone that is actually concerned about his post! Good job on CODE Tags!

I used boost in a project where I needed regex. It is distributed as a library though so you need to link to it.

More info here:
http://www.boost.org/doc/libs/release/libs/regex/

With this library, you can put ( ) around the part that you want to return as a match. So e.g.:

Message:
"The price is: 40.00"
And your regex is:
".* ([0-9]{1,}\.[0-9]{1,})"

The boost library will store the price for you.
The example is just for illustration.. if there is a mistake in it, please don't cry a river :P

Thanks. I guess I'll go with boost. I'm writing this for a class that isn't a programming class, so I was hoping to keep the linking to a minimum. I already have to link to curlpp, which isn't exactly standard.

I've been researching boost::regex, but it's all extremely confusing. I can't seem to find the class/method to return all matches within the ( ). So far I have:

const boost::regex re("<[:blank:*,^*]*a[:blank:*,^*]+href[:blank:*,^*]*=[:blank:*,^*]*\"(.+)\">", boost::regex_constants::icase);

Does this look like it will work? Am I missing something? As I said, I'm not very good with regular expressions.

Also. I forgot to take into account non-url links, for example file extensions. I'll take that into account after I see if I'm even close with my regular expression.

And this doesn't take into account explicit ports, but I think I'm going to forget about that for now.

Upon further research, apparently I'm retarded for trying to do this with regular expressions. I should have done my googling before jumping into writing a regex. Apparently I should be looking for an HTML -> XML converter/XML parser.

Ok I've had a lot of trouble finding an HTML parser. I'd think there would be more. Is there anything that anyone here knows that will extract links out of html? Preferably something without a runtime library?

This article has been dead for over six months. Start a new discussion instead.