Hey all,

I usuallly don't like asking for help on these sites but I have been kicking myself for a while now implementing boost regex for a program (i have never had any prior issues). I believe the issue is with my regex syntax but i have tried numerous variations and have yet to have a successful run. It looks like:

scrape(game, "<some-tag>(.*?)</some-tag>", 1);

fairly simple. the scrape function is a public function for a handler class i made which is implemented like:

std::string my_class::scrape(const std::string &base, const std::string &match, const int &set)
{
     boost::regex re(match);
     boost::smatch matches;

     if(boost::regex_search(base, matches, re))
     {
          std::string value(matches[set].first, matches[set].second);
          return value;
     }

     return "";
}

Like I said, I think the issue has something to do with '<', '>', or some other syntax in the regex.

Any help would be greatly appreciated. Thank you

Recommended Answers

All 4 Replies

Do you have to escape the angled brackets? So use \< and \> instead of < and >, respectively.

From a class I wrote a while back.. I used: <td(.*?)</td>
which is a non greedy regex so it will return all matches. Anyway this is the stuff I was speaking of. I used it to create a stock ticker that grabbed Stocks from a website and display their values, etc..

PregMatchAll is actually from PHP but I sorta translated it to Cpp. StrPos will just convert the regex matches to a position in the file. Substr will get us the info between the tags.

void preg_match_all(string Source, boost::regex &expression, string &ID)
{
       try
       {
           std::string::const_iterator start, end;
           start = Source.begin();
           end = Source.end();
           boost::smatch what;
           boost::match_flag_type flags = boost::match_default;

           while(boost::regex_search(start, end, what, expression, flags))
           {
                //Destination = boost::regex_replace(Source, expression, "");
                ID = what[0];
                start = what[0].second;
           }
       }
       catch(exception &e)
       {
           cout<<"Exception Caught.. Function: preg_match_all.\n\n";
       }

    return;
}


static size_t strpos(string Data, string Regex, int Offset, int SizeOf_Regex, int additional)
{
    size_t Found = 0;
    try
    {
        Found = Data.find(Regex.c_str(), Offset, SizeOf_Regex) + additional;
    }
    catch(exception &e){}

    return Found;
}

string  DataHolding = "Some HTML File";  //This will be our source.

int main()
{
   size_t Start, End;
   boost::regex SnExpression("<[a-z]+ class=\"wsod_smallSubHeading\"", boost::regex::icase);
   boost::regex SxExpression("<h1 class=\"wsod_fLeft(.*)\" style=\"margin-top:6px;\">", boost::regex::icase);

   string StockID, StockX;
   preg_match_all(DataHolding, SnExpression, StockID);   //Match the first part of the tags..
   preg_match_all(DataHolding, SxExpression, StockX);    //Match the end tag..

   Start = strpos(DataHolding, StockX, 0, StockX.size(), StockX.size());  //Get The position of the END of the start tag.
   End = strpos(DataHolding, StockID, Start, StockID.size(), -1);         //Get the position of the BEGINNING of the end tag..
   string Final = DataHolding.substr(Start, End-Start);     //From the Start Pos, Copy Everything Until the End Pos to a string..
}

hey all thanks for responding. sorry for the delay but i haven't been able to log in for the past week or two.

ravenous - yes i tried to backslash the angle brackets to no avail

deanmsands - thanks for the link. i will def take a look but i was hoping to leverage an API i already have to scrape data of web pages. this specific example happens to be xml. i also dont really need to spend my time deserializing the entire message as i am only targetting a few vals

triumphost - thanks for the snippet. i will try a similar approach today to see if it works.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.