RegEx on char array

Question

Suzie999 245 Coding Hobbyist

12 Years Ago

Hi I'm hoping to get some starting points here in a subject that scares me, RegExp.

I've looked at a couple of tutorials but they all seem to be based on std::string.
I'd like to use std::tr1::regex but I don't know where to start or even if it's possible
without conversion, and whether a conversion would be too expensive to the speed of
my code.

Here is an example of a string I find in another larger string

"%xt%var%-1%var%var%var%var%var%var%var%var%var%var%"

Where var's are the values I want to retrieve. There may or may not be multiple occurances of
the substring from which I want my values, but it always starts with "%xt%" and the "-1" is
also always present, but the substring may be shorter or longer, or it might be the whole string
as opposed to a sub string, but always ends with ascii char 00.

oops did I mention it's an unsigned char array..

Currently I am paring this with multiple calls of various hand rolled functions of iterations
and it's driving me bonkers.

I'll stop there in case it is not even possible.

I'm sorry about having no code to submit, but I simply have zero clue with regards regexp.

c++ regex

4 Contributors
11 Replies
2K Views
4 Days Discussion Span
Latest Post 12 Years Ago Latest Post by Suzie999

All 11 Replies

vijayan121 1,152 Posting Virtuoso

12 Years Ago

I'd like to use std::tr1::regex

You have int _tmain(int argc, _TCHAR* argv[]); so presumably you are using the Microsoft compiler. If it is a recent version (2010 or later), use std::regex instead.

I was expecting all values between % and %

The regular expression %(.*?)% consumes the terminating %, so regex_search() wont find a starting % when looking for the next match. Since you are do not need captures of sub-matches, using a regex_iterator would perhaps be simpler.

#include <iostream>
#include <regex>
#include <algorithm> 

int main ()
{
    const char str[] = "%xt%how%-1%now%brown%cow%" ;

    // % followed by any character other than % repeated 1 or more times
    std::regex re( "%[^%]+" ) ; 

    // std::end(str)-1 : -1 to omit the terminating null character from the range
    const std::cregex_iterator begin( std::begin(str), std::end(str)-1, re ), end ;
    std::cout << "#matches: " << std::distance(begin,end) << '\n' ;

    int n = 0 ;
    for( auto iter = begin ; iter != end ; ++iter )
        std::cout << "  " << ++n << ". " << iter->str() << "%\n" ;
}

Output:

#matches: 6
  1. %xt%
  2. %how%
  3. %-1%
  4. %now%
  5. %brown%
  6. %cow%

Edited 12 Years Ago by vijayan121

Suzie999 commented: excellent help +2

vijayan121 1,152 Posting Virtuoso

12 Years Ago

I thought a major point of regex was that it negates the need to iterate throug a string. I was expecting to have an array of matches after a regex was carried out.

You do not need to iterate through a string if all that you want is a full match.
If you want to access each submatch one by one, you need to iterate.

I was expecting to have an array of matches after a regex was carried out.

std::match_results<> returned by std:regex_match() will contain a sequence of submatches if and only if you have subexpression captures in the regex.

expected output
title1%0.0%4%796.0
title2%0.0%4%796.0
output
title1%0.0%4%796.0% %xt%title2%0.0%4%796.0%

std:regex_match() performs a greedy match - it will match as much of the string as possible.

Is it not possible to get an array of matches?

It is. For example:

#include <iostream>
#include <regex>
#include <algorithm> 
#include <vector>
#include <string>

std::vector< std::string> get_submatches( const std::string& str, 
                                          const std::regex& pattern )
{
    std::vector< std::string> submatches ;

    // iterate through the submatches (omitting the full match at position zero)
    std::sregex_token_iterator iter( str.begin(), str.end(), pattern, 1 ), end ;
    for(  ; iter != end ; ++iter ) submatches.push_back( iter->str() ) ;

    return submatches ;
}

std::vector< std::string > extract_words( const std::string& str, const std::regex& re )
{
    std::vector< std::string > result ;

    const std::sregex_iterator begin( str.begin(), str.end(), re ), end ;
    for( auto iter = begin ; iter != end ; ++iter )
        result.push_back( iter->str() ) ;

    return result ;
}

int main ()
{
    const std::string& txt = "%xt%title 1%2.3%4%567.8% %xt%title 9%8.7%6%543.2% " ;
    std::cout << "text: " << txt << '\n' ;

    // pattern starts with %xt and ends with a % space sequence 
    const std::regex pattern( "%xt(.*?)% " ) ;
    // get the submatches (,*?)
    const auto submatches = get_submatches( txt, pattern ) ;

    for( std::size_t i=0 ; i < submatches.size() ; ++i ) // for each submatch
    {
        const std::string& sm = submatches[i] ; 
        std::cout << "submatch: " << sm << '\n' ;

        // in each submatch, extract each word between two % characters
        static const std::regex re( "[^%]+" ) ;
        const auto words = extract_words( sm, re ) ;
        for( std::size_t i=0 ; i < words.size() ; ++i ) // and print them
            std::cout << "  " << words[i] << '\n' ;

    }
}

Output:

text: %xt%title 1%2.3%4%567.8% %xt%title 9%8.7%6%543.2%
submatch: %title 1%2.3%4%567.8
  title 1
  2.3
  4
  567.8
submatch: %title 9%8.7%6%543.2
  title 9
  8.7
  6
  543.2

Edited 12 Years Ago by vijayan121

Suzie999 commented: Superb help here absolutely delighted +0

vijayan121 1,152 Posting Virtuoso

12 Years Ago

have also tested the ascii values of the string and they are identical always beginning with 37 120 116(%xt) and ending with 37 0 (% ).

Ending with % null-char is not the same as ending with % space.

// pattern starts with %xt and ends with a % space sequence 
// const std::regex pattern( "%xt(.*?)% " ) ;

// pattern starts with %xt and ends with a % null-char sequence 
const std::regex pattern( "%xt(.*?)%\000" ) ;

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Ancient Dragon 5,243 Achieved Level 70 Team Colleague Featured Poster · Answer 1 · 2012-09-15T02:09:35+00:00

Ancient Dragon 5,243 Achieved Level 70

12 Years Ago

I just googled for it and found this tutorial as the very first hit

Suzie999 245 Coding Hobbyist · Answer 2 · 2012-09-15T03:22:55+00:00

Yes, that was one of the few I read for std::strings.

Cheers, I'm making a little progress, but not much.

int _tmain(int argc, _TCHAR* argv[])
{
    cmatch result;
    unsigned char raw[] = "%xt%var%-1%var%var%var%var%var%var%var%var%var%var%";
    int raw_len = sizeof(raw);
    string std_str(raw, raw + raw_len);
    regex regexp("%(.*?)%");
    regex_search(std_str.c_str(), result, regexp);
    cout << sizeof(result) << endl;
    cout << result[0]  << endl << result[1]  << endl << result[2] << endl;
    cin.get();

    return 0;
}

Output though is

60 ;sizeof(result)
%xt% ;result[0]
xt ;result[1]
empty ;result[2]

I was expecting all values between % and % :(

Suzie999 245 Coding Hobbyist · Answer 3 · 2012-09-15T12:07:24+00:00

Thank you vijayan121, I appreciate your time.

I must a admit though, I'm a little confused, I thought a major point
of regex was that it negates the need to iterate throug a string.

I was expecting to have an array of matches after a regex was carried out.
It's mind boggling this regex :)

Is it not possible to get an array of matches?
I will need also to get rid of all the "%" as they will be surplus to requirements.

rubberman 1,355 Nearly a Posting Virtuoso Featured Poster · Answer 4 · 2012-09-15T12:53:31+00:00

Not relevant directly to solving your problem, but you might be interested in Henry Spencer, the author of regex: http://en.wikipedia.org/wiki/Henry_Spencer

Suzie999 245 Coding Hobbyist · Answer 5 · 2012-09-15T22:39:06+00:00

Here is my latest attempt, I cannot seen to figure these regexp out :(

#include "stdafx.h"
#include <iostream>
#include <regex>
using namespace std;
using namespace std::tr1;
using namespace std::regex_constants;

void parse_test_str(const std::string& test_str)
{

    //match_flag_type fl = extended;

    // regular expression
    const regex pattern("%xt%(.*?)% ");

    // object that will contain the sequence of sub-matches
    match_results<string::const_iterator> result;

    // match the test_str address with the regular expression
    bool valid = regex_match(test_str, result, pattern);

    cout << test_str << " \t: " << (valid ? "found" : "not found") 
        << endl << endl;

    // if the test_str matched the regex, then print the parts
    if(valid)

    {
        cout << "size = " << result.size() << endl;
        for (int i = 0; i < result.size(); i++ ){

            cout << result[i] << endl;
        }

    }
}

int main()
{
   parse_test_str("%xt%title1%0.0%4%796.0% %xt%title2%0.0%4%796.0% ");

   cin.get();

   return 0;
}

expected output

title1%0.0%4%796.0
title2%0.0%4%796.0

output

title1%0.0%4%796.0% %xt%title2%0.0%4%796.0%

I have a feeling it might be something to do with the regexp type constant but I'm even more lost with those, there does not seen to be much about it via search.

Suzie999 245 Coding Hobbyist · Answer 6 · 2012-09-16T09:23:09+00:00

vijayan121

I have to say,I cannot thank you enough for this help, it would have taken me weeks if not months to come up with something close to this, and it also explains to me a lot about vectors, which I am also clueless at.

I really appreciate your help and time.

thanks++

Now to get my head around what is going on in your code and adapt it.

Suzie999 245 Coding Hobbyist · Answer 7 · 2012-09-19T13:01:31+00:00

I have progressed a fair bit thanks, but I have a problem with real
world implementation/results.

I have used the example code above and it works just fine.

But when the code is in the wild I have a big problem.

The string in this line const std::string& txt = "%xt%title 1%2.3%4%567.8% %xt%title 9%8.7%6%543.2% " ; is retrieved on the fly in the program and seems to be fine until it reaches the part in the code where it is passed to

std::vector< std::string> get_submatches( const std::string& str, 
                                          const std::regex& pattern )
{
    // test the string in tact and as expected.
    cout << str << endl; // this outputs the exact same string as though it
                         // were entered manually as in the example.

    //test the string length
    cout << str.length() << endl; // seems fine always correct length

    std::vector< std::string> submatches ;

    // iterate through the submatches (omitting the full match at position zero)
    std::sregex_token_iterator iter( str.begin(), str.end(), pattern, 1 ), end ;

    //test there were matches.
    cout << (iter == end) << endl; // here I find that in the wild there are
                                   // no matches ie output is 1 (iter==end)  
                                   // whereas it is 0 in the example

    for(  ; iter != end ; ++iter ) {

        submatches.push_back( iter->str(); ) 
    }

    return submatches ;
}

I have also tested the ascii values of the string and they are identical
always beginning with 37 120 116(%xt) and ending with 37 0 (% ).

Anyone have some ideas why this might happen?

Suzie999 245 Coding Hobbyist · Answer 8 · 2012-09-19T16:41:52+00:00

I'm a little bit annoyed with myself for not seeing that after viewing
the ascii codes for that purpose, I looked at 0 and seen space.

Thank you very much once more, you have helped me a lot

RegEx on char array

Recommended Answers Collapse Answers

All 11 Replies

Recommended Answers