Hi,
I have a HTML page in one variable.
I need build a mehod that will extract a tag content (dif extract_tag(self, tag_name)).
For example, given webpage:

<div id="mw-page-base" class="noprint"></div>
<div id="mw-head-base" class="noprint"></div>
<!-- content -->
<div id="content" class="mw-body">
    <a id="top"></a>
    <div id="mw-js-message" style="display:none;"></div>
<!-- sitenotice -->
    <div id="siteNotice"><!-- centralNotice loads here --></div>
    <!-- /sitenotice -->
<!-- firstHeading -->
    <h1 id="firstHeading" class="firstHeading"><span dir="auto">Earth</span></h1>
</div>

and tag named: "content" the method should return

    <a id="top"></a>
    <div id="mw-js-message" style="display:none;"></div>
<!-- sitenotice -->
    <div id="siteNotice"><!-- centralNotice loads here --></div>
    <!-- /sitenotice -->
<!-- firstHeading -->
    <h1 id="firstHeading" class="firstHeading"><span dir="auto">Earth</span></h1>

I want to do it using regex, but I'm not fimiliar Python.
Nativy, I think the pattern I'm looking for is "\<tag_name(.(\n)(\<div id.(\n)\<\/div>)).\<\/div>".
Note that another tag can be found inside given tag.

Is this statement good enoght? How do I use re.compile and re.match?

Thanks,
Net

BTW - I know it can be done using bs, but I prefer not to.

So this was a fun exercise for me...

I think this should get you at least started...

import sys
import re
import os

try:
    #open our document and read it as a string
    doc = open(os.path.abspath(sys.argv[1]), 'r').read()
    #declare and compile our pattern as raw string
    regex = r'<!--content-->'
    reg = re.compile(regex)
    #split the content between our pattern points
    match = re.split(regex, doc)
    try:
        #if we get a matching sequence, print everything within the sequence
        #be wary of a TON of newline carriage returns
        print("Matches: ", match[1:])
    except Exception as e:
        #if there is an error, display it.
        print ("No match found: ", e)
except:
    #if we can't open it, display the path to our document and see if it's
    #even there.
    try:
        path = os.path.abspath(sys.argv[1])
    except:
        #if no document...
        path = 'No file given'
    #print the result
    print("There was an error opening your file.... File given: " + path)

the document i tested this on was a simple text document with the following inside:

asfasfasfafafsfs

<!--content-->

aisbfaibfa;ifbaifafnsabfasifnajfnafanfas )) AFAFA)A)__ )A )__)_0
9as09fansfoaf a00-0--

<html> bajsfnasfA </html>

afnifnafmaf

<!--content-->

a
sf
asf
af
asf
a
sfafafsaf

There is probably a cleaner way to do this, but this was my solution... if you plan on inserting, I think you want re.sub instead of re.search. re.match will fail every time, as it looks to the beginning of the string.

Hope that helps!

Ryan

This article has been dead for over six months. Start a new discussion instead.