Parse an HTML doc

Question

sarelnet 13 Newbie Poster

12 Years Ago

Hi,
I have a HTML page in one variable.
I need build a mehod that will extract a tag content (dif extract_tag(self, tag_name)).
For example, given webpage:

<div id="mw-page-base" class="noprint"></div>
<div id="mw-head-base" class="noprint"></div>
<!-- content -->
<div id="content" class="mw-body">
    <a id="top"></a>
    <div id="mw-js-message" style="display:none;"></div>
<!-- sitenotice -->
    <div id="siteNotice"><!-- centralNotice loads here --></div>
    <!-- /sitenotice -->
<!-- firstHeading -->
    <h1 id="firstHeading" class="firstHeading"><span dir="auto">Earth</span></h1>
</div>

and tag named: "content" the method should return

    <a id="top"></a>
    <div id="mw-js-message" style="display:none;"></div>
<!-- sitenotice -->
    <div id="siteNotice"><!-- centralNotice loads here --></div>
    <!-- /sitenotice -->
<!-- firstHeading -->
    <h1 id="firstHeading" class="firstHeading"><span dir="auto">Earth</span></h1>

I want to do it using regex, but I'm not fimiliar Python.
Nativy, I think the pattern I'm looking for is "\<tag_name(.(\n)(\<div id.(\n)\<\/div>)).\<\/div>".
Note that another tag can be found inside given tag.

Is this statement good enoght? How do I use re.compile and re.match?

Thanks,
Net

BTW - I know it can be done using bs, but I prefer not to.

html-css python regex

2 Contributors
1 Reply
252 Views
2 Days Discussion Span
Latest Post 12 Years Ago Latest Post by ryantroop

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

ryantroop 177 Practically a Master Poster · Answer 1 · 2012-08-22T15:23:58+00:00

So this was a fun exercise for me...

I think this should get you at least started...

import sys
import re
import os

try:
    #open our document and read it as a string
    doc = open(os.path.abspath(sys.argv[1]), 'r').read()
    #declare and compile our pattern as raw string
    regex = r'<!--content-->'
    reg = re.compile(regex)
    #split the content between our pattern points
    match = re.split(regex, doc)
    try:
        #if we get a matching sequence, print everything within the sequence
        #be wary of a TON of newline carriage returns
        print("Matches: ", match[1:])
    except Exception as e:
        #if there is an error, display it.
        print ("No match found: ", e)
except:
    #if we can't open it, display the path to our document and see if it's
    #even there.
    try:
        path = os.path.abspath(sys.argv[1])
    except:
        #if no document...
        path = 'No file given'
    #print the result
    print("There was an error opening your file.... File given: " + path)

the document i tested this on was a simple text document with the following inside:

asfasfasfafafsfs

aisbfaibfa;ifbaifafnsabfasifnajfnafanfas )) AFAFA)A)__ )A )__)_0
9as09fansfoaf a00-0--

<html> bajsfnasfA </html>

afnifnafmaf

a
sf
asf
af
asf
a
sfafafsaf

There is probably a cleaner way to do this, but this was my solution... if you plan on inserting, I think you want re.sub instead of re.search. re.match will fail every time, as it looks to the beginning of the string.

Hope that helps!

Ryan