regular expression for html web source

Question

Acidburn 0 Posting Pro

13 Years Ago

Hi guys, I've been banging my head all day with this

I've got a regular expression and im trying to extract a menu from the html page source to assert within a test. For the life of me i can't get it to work correctly.

Here's the page source:

<a href="#"></a><a href="#"">Home</a>
 : 
<a href="#" style="text-decoration:none;">Hello</a>
 : 
<a href="#" style="text-decoration:none;">World</a>
 : 
Today
<a id="sitemap"></a>

And here's the regex...

Basically I want 'Home : Hello : World : Today'

But this is in the middle of a html page so I want to ignore everything else.

Here's my attempt at a regex
(<span\s*id="sitemap"\s.* )

but this doesnt appear to be working.

html-css regex

2 Contributors
1 Reply
148 Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by thines01

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

thines01 401 Postaholic Team Colleague Featured Poster · Answer 1 · 2011-07-16T21:47:36+00:00

There is a problem when the items you're capturing are wrapped in something that can be caught by the Regular Expression.

You might need to remove some unwanted elements before using a Regex.
Are you reading this as one string or multiple?

Does it really need to be a regular expression?
[Assuming you can use Linq]
Treated as one big string, you could parse it with something like:

Console.WriteLine(
            string.Join(" ", 
               strRawHtml.Split("<>\r\n".ToArray(), StringSplitOptions.RemoveEmptyEntries)
               .Where(
                  s => !s.Contains('/')
                  && !s.Contains('"')
                  && !s.Equals("span")
               ).ToArray()));