Trying to find attributes in html tag thru python

Question

Hikki_Passion 0 Newbie Poster

12 Years Ago

Ok, so I am trying to write a code that will find the attributes in a html tag using lists and splitting them up. Unfortunately its giving me a hard time with the added quotation marks.

def find_attribute_value(html_tag, att):
    '''Return the value of attribute att (a str) in the str html_tag.  
    Return None if att doesn't occur in html_tag.
    '''
    
    words = html_tag.split(" ")
#this is me trying to split up the initial html tag.

    omgs= str (words)
# then trying to convert it into a string because it wouldn't let me split it up further otherwise :(

    list2 = omgssss.split()    
# then I split it up again

    for word in list2:
        second_split = word.split("=")
#and again because I needed to separate the attributes from the equal sign.


#this is what I want to do with my code, but I can't implement it because my code looks like it was written by a 4 year old. I want to be able to recognize the attributed value and call the value. 
        if att == item in list:
            print list[att]
        print second_split

my example that I was testing it out on was this html tag:

find_attribute_value('<img align=top src="photos/horton.JPG" alt="Image of StG instructor (Diane Horton)">', "src")

and i'm trying to find the att being src
which would bring up "photos/horton.JPG"

but obviously it doesn't

index list python

4 Contributors
9 Replies
999 Views
1 Year Discussion Span
Latest Post 10 Years Ago Latest Post by james.lu.75491856

All 9 Replies

woooee 814 Nearly a Posting Maven

12 Years Ago

For the example line you gave

def find_attribute_value(html_tag, att):
    ## is att in html?
    if att in html_tag:
        ## find the location of att
        idx = html_tag.find(att)
        ##split on the quotation marks for everything after att
        first_split = html_tag[idx:].split('"')
        print first_split[1]
    else:
        print "attribute %s Not Found" % (att)

def find_attribute_value2(html_tag, att):
    """ this is the way you were trying to do it
    """
    first_split = html_tag.split()
    for x in first_split:
        if att in x:
            second_split = x.split("=")
            fname=second_split[1].replace('"', "")
            print fname
            return

test_line = find_attribute_value('<img align=top src="photos/horton.JPG" alt="Image of StG instructor (Diane Horton)">', "src")

test_line = find_attribute_value2('<img align=top src="photos/horton.JPG" alt="Image of StG instructor (Diane Horton)">', "src")

and note that this will only find the first instance of "att". And you can split on "=" as a second split(), if you know there is always space after the file name.

Edited 12 Years Ago by woooee because: n/a

Hikki_Passion commented: YOU MADE IT WORK! :O Genious! +1

snippsat 661 Master Poster

12 Years Ago

Now do python have strong parser like lxml and BeautifulSoup,that do job like this much easier.

>>> from BeautifulSoup import BeautifulSoup
>>> html = '''<img align=top src="photos/horton.JPG" alt="Image of StG instructor (Diane Horton)">', "src"'''
>>> soup = BeautifulSoup(html)
>>> tag = soup.find('img')
>>> tag['src']
u'photos/horton.JPG'

Edited 12 Years Ago by snippsat because: n/a

snippsat 661 Master Poster

12 Years Ago

One with regex you can look at,this is also not an ideal way when it comes to html.

import re

def find_attribute_value(html, att):
    s = re.search(r'%s="(.*?)"' % att, html)
    return s.group(1)

html = '''<img align=top src="photos/horton.JPG" alt="Image of StG instructor (Diane Horton)">', "src"'''
print find_attribute_value(html, 'src')
#photos/horton.JPG

print find_attribute_value(html, 'alt')
#Image of StG instructor (Diane Horton)

Edited 12 Years Ago by snippsat because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Hikki_Passion 0 Newbie Poster · Answer 1 · 2011-10-27T00:48:44+00:00

Thank you so much, your way is so elegant, not to mention it works!

However, a tiny problem, that I still can't understand. If the attribute is in quotation marks and has a space after it, is there a way to get the entire quote? or would that not be possible with a small program?

ex would be alt="Image of StG instructor (Diane Horton)"
the att = alt
and the program should spit out "Image of stg instructor (Diane Horton)"

This problem is solved in the first function because it splits it according to the quotation marks. But then you can't call att= align because it gives
>>> photos/horton.JPG
instead of
>>> top

Thank you so much for your help already! And such a speedy reply<3

Hikki_Passion 0 Newbie Poster · Answer 2 · 2011-10-27T04:32:00+00:00

Now do python have strong parser like lxml and BeautifulSoup,that do job like this much easier.

>>> from BeautifulSoup import BeautifulSoup
>>> html = '''<img align=top src="photos/horton.JPG" alt="Image of StG instructor (Diane Horton)">', "src"'''
>>> soup = BeautifulSoup(html)
>>> tag = soup.find('img')
>>> tag['src']
u'photos/horton.JPG'

I know but this is just using basic python lol. Otherwise wouldn't life be easier! hahah, thanks anyways <33

Hikki_Passion 0 Newbie Poster · Answer 3 · 2011-10-27T08:21:37+00:00

Thanks, I just wanted to know if it was possible. Thank you so much for your answers guys!

woooee 814 Nearly a Posting Maven · Answer 4 · 2011-10-27T08:25:15+00:00

You did not say anything about also finding "align=top". To do that, check if the string starts with a quotation mark, in which case the first function works fine. If no quotation mark is found, then split on white space. You will have to code some of this yourself instead of giving us one task after another until the program is written for you. This forum is for helping those with code, so if you post code then we will help.

james.lu.75491856 0 Junior Poster · Answer 5 · 2013-08-16T00:33:37+00:00

james.lu.75491856 0 Junior Poster

10 Years Ago

use HTTMLParser.HTMLParser

james.lu.75491856 0 Junior Poster · Answer 6 · 2013-08-16T12:13:43+00:00

from HTMLParser import HTMLParser
class parser(HTMLParser):
    def handle_starttag(self,tag,attr):
        #attrs is adictionary

Trying to find attributes in html tag thru python

Recommended Answers Collapse Answers

All 9 Replies

Recommended Answers