Hi,
I am new to regular expressions, I am having a problem regarding matching a specific keyword in certain condition.

string regexstring="\\bhtml\\b|[.]net\\b";
Regex rgx = new Regex(regexstring, RegexOptions.IgnoreCase);
MatchCollection matcol = null;
            string st_data = "I am a .net developer, but also know about asp.net, vb.net ; I work on c#/asp.net platform. I also know dhtml, html4.0, html/xml etc etc.";
//I want the regex to capture all occureneces for html where html is a word seperated \b-->word boundary or is surrounded by decimals like html4.0 above.
            st_data = System.Web.HttpUtility.HtmlDecode(Regex.Replace(st_data, @"<(.|\n)*?>", string.Empty));
            matcol = rgx.Matches(st_data);

            foreach (Match mat in matcol)
            {
                //I WILL GET the mat.value here.
            }

I tried various variations but of no use.

I want to match html4.0 but some how I need only html out of it.kind of substringed match.

I hope You understand my point.

Please help

any help shall be appreciated.
Thanks
shankbond

Recommended Answers

All 10 Replies

I have this so far string regexstring = @"\bhtml(\d+(\.(\d*))?)?\b|[.]net\b"; It returns 6 matches: ".net", ".net", ".net", ".net", "html4.0", "html"
But I think you want the "htlm4.0" to match as "html" only. Is that right?

I have this so far string regexstring = @"\bhtml(\d+(\.(\d*))?)?\b|[.]net\b"; It returns 6 matches: ".net", ".net", ".net", ".net", "html4.0", "html"
But I think you want the "htlm4.0" to match as "html" only. Is that right?

Unfortunately yes :)

Got it

string regexstring = @"\bhtml(?=\d+(\.\d*)?\b)?|\.net\b";

No I haven't. It matches htmllkasjdf aswell

Try this one

string regexstring = @"\bhtml\b|\bhtml(?=\d+(\.\d*)?)|\.net\b";

Try this one

Thanks that worked.

Can You please explain the regex You used?
I don't understand the \bhtml(......)the part in the paranthesis

The "(?=...)" is a grouping construct for a "Zero-width positive lookahead assertion". In other words, it looks forward to match but is not included in the match.
The other bit "\d+(\.\d*)?" just captures numbers with or withour decimal part. You could set this to just "\d+"; but that would capture "html4you" as well as "html4.0".

This link explain most of the bits used in Regex. Regular Expression Language Elements
I alway have to read it to find out what to use.

Many Thanks for the reply again, though I had been working on regexs for hours know and scratching my head.
But still I cannot understand the difference between
(?:........) non capturing expressions
(?=.............) look behind non capturing
(?<=.............) look ahead non capturing

I tried (?:...) in place of look ahead but it didn't worked out.

can You explain a little about it.

However by tweaking a little bit of code given by You.
I created
(?<=\d+(\.\d*)?|\b)html(?=\d+(\.\d*)?|\b)

which matches html, 221html, html4.0, but not abchtml, etc....

I think that "noncapturing group" means that the expression is matched but should not create a new group in the Match.Groups collection.

The other two are described as "zero-width"; which, from what we see happening, clearly means must match but is not captured.

I am no expert on RegEx and, whenever I have needed to use them, I too spend hours fiddling with different options until I either find what I need or give-up and find some other method.

I just found this web page that seems to explain RegEx in quite a bit of detail. Take a look.
The 30 Minute Regex Tutorial

Good luck with you project.

Thank you for the article nick, really interesting, i didnt know these "must match but do not capture" operator, pretty awesome.

Thanks a lot again nick.
The problem is solved now.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.