hi, i am using regular expression in extract paragraph from html code but it gives me only one (the first one) line written in first <p> i want the whole article in my string. here is my code.

Match m = Regex.Match(htmlstring, @"<p>\s(.+?)\s</p>");

where "htmlstring" all the html code in text form in it.

Recommended Answers

All 4 Replies

Don't reinvent the wheel! Use Html Agility Pack

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

can you not give the <p> tag an ID attribute and then use that ID to reference the text attribute of the <p> tag ?

@Chris i can't understand that. actually i got all the html code in a string and now i just want to pull out the paragraphs written in between "p" tags. actually i am trying to fetch the whole article.

Well I can't vouch for HAP, I do agree with __avd. In any case, by default the dot operator (.) matches any character except the line feed character (\n). You can change the default behaviour by doing something like this:

Regex regex = new Regex(@"<p>\s(.)\s</p>", RegexOptions.SingleLine);
Match m = regex.Match(htmlstring);

Just note, that there are a lot of things that can go wrong when doing this. The other option is to use * after the subexpression to capture each as a group. If you want to capture multiple paragraphs, wrap the whole thing in round braces () and add the * to capture multiple groups.

F**ks sake, I am really starting to hate this new text editor. Writing this one short post was painful.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.