How to extact photo from html doc

Question

akie2741 0 Newbie Poster

15 Years Ago

how can i extact the photo from this website>>http://www.it.usyd.edu.au/about/people/staff/tomc.shtml ?

html-css python

4 Contributors
11 Replies
178 Views
21 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by vegaseat

All 11 Replies

Gribouillis 1,391 Programming Explorer

15 Years Ago

Here is a way to get the addresses of the images

# extract the addresses of the images included in a web page
# you need the modules lxml and beautifulsoup
# (for linux, packages python-lxml and python-beautifulsoup)
# tested with python 2.6 
from lxml.html import soupparser
from urllib2 import urlopen

def gen_elements(tag, root):
    if root.tag == tag:
        yield root
    for child in root:
        for elt in gen_elements(tag, child):
            yield elt

def gen_img_src(url):
    content = urlopen(url).read()
    content = soupparser.fromstring(content)
    for elt in gen_elements("img", content):
        yield elt.attrib.get("src", None)

def main():
    url = "http://www.it.usyd.edu.au/about/people/staff/tomc.shtml"
    for src in gen_img_src(url):
        print(src)

if __name__ == "__main__":
    main()

Gribouillis 1,391 Programming Explorer

15 Years Ago

it doesn't work,when i run the code,it gave
Traceback (most recent call last):
File "C:/Users/ALEXIS/Desktop/extactphoto.py", line 5, in <module>
from lxml.html import soupparser
ImportError: No module named lxml.html

As I said, you need the lxml module.
@kaninelupus. there are different ways to understand the question. I don't think my solution is complex.

vegaseat 1,735 DaniWeb's Hypocrite

15 Years Ago

thx...how about beatifulsoup,how to install it?

http://www.crummy.com/software/BeautifulSoup/

Somewhat simple minded, but you could try this and go from there ...

# retrieve the html code of a given website
# and check for potential image sources
# tested with Python 2.5.4

import urllib2

def extract(text, sub1, sub2):
    """
    extract a substring from text between first
    occurances of substrings sub1 and sub2
    """
    return text.split(sub1, 1)[-1].split(sub2, 1)[0]


url_str = 'http://www.it.usyd.edu.au/about/people/staff/tomc.shtml'
fin = urllib2.urlopen(url_str)
html = fin.read()
fin.close()
  
#print(html)  # test
html = html.lower()

while True:
    try:
        s = extract(html, '<img src=', '/>')
        print s
        if not s:
            break
        pos = html.find(s) + len(s)
        # slice to potential next image
        html = html[pos:]
    except:
        break

Edited 15 Years Ago by vegaseat because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

kaninelupus 275 Practically a Posting Shark · Answer 1 · 2009-10-18T14:32:37+00:00

OK - that's an overly complex answer for a question which requires a simple answer.

Right-click on image and save.... nothing more complex required

@Gribouillis - bear in mind that if the OP is even asking how to source the original image, it makes me wonder if he is even the owner of said image!

akie2741 0 Newbie Poster · Answer 2 · 2009-10-18T16:30:28+00:00

it doesn't work,when i run the code,it gave
Traceback (most recent call last):
File "C:/Users/ALEXIS/Desktop/extactphoto.py", line 5, in <module>
from lxml.html import soupparser
ImportError: No module named lxml.html

akie2741 0 Newbie Poster · Answer 3 · 2009-10-18T17:31:45+00:00

akie2741 0 Newbie Poster

15 Years Ago

how to get lxml module?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 4 · 2009-10-18T17:42:17+00:00

Gribouillis 1,391 Programming Explorer

15 Years Ago

how to get lxml module?

HERE

akie2741 0 Newbie Poster · Answer 5 · 2009-10-18T18:20:07+00:00

akie2741 0 Newbie Poster

15 Years Ago

it seems complex and hard to install...

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 6 · 2009-10-18T18:35:41+00:00

it seems complex and hard to install...

what's your os ? For windows, all you have to do is to download the MS windows installer for your version of pyton and your processor (32 or 64 bits) from http://pypi.python.org/pypi/lxml/2.2.2 and run the installer.

akie2741 0 Newbie Poster · Answer 7 · 2009-10-18T19:03:11+00:00

akie2741 0 Newbie Poster

15 Years Ago

thx...how about beatifulsoup,how to install it?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 8 · 2009-10-18T19:13:27+00:00

Go here and download "Beautiful Soup version 3.1.0.1". This is a compressed .tar.gz file. To uncompress it with windows, you'll need 7zip from here. Right click on the BeautifulSoup.tar.gz and uncompress once, which should give you a file with suffix .tar, uncompress a second time and you should get a folder BeautifulSoup. Then copy BeautifulSoup.py to the site-packages directory in your python lib. It should work then.

How to extact photo from html doc

Recommended Answers Collapse Answers

All 11 Replies

Recommended Answers