how can i extact the photo from this website>>http://www.it.usyd.edu.au/about/people/staff/tomc.shtml ?

Recommended Answers

All 11 Replies

Here is a way to get the addresses of the images

# extract the addresses of the images included in a web page
# you need the modules lxml and beautifulsoup
# (for linux, packages python-lxml and python-beautifulsoup)
# tested with python 2.6 
from lxml.html import soupparser
from urllib2 import urlopen

def gen_elements(tag, root):
    if root.tag == tag:
        yield root
    for child in root:
        for elt in gen_elements(tag, child):
            yield elt

def gen_img_src(url):
    content = urlopen(url).read()
    content = soupparser.fromstring(content)
    for elt in gen_elements("img", content):
        yield elt.attrib.get("src", None)

def main():
    url = "http://www.it.usyd.edu.au/about/people/staff/tomc.shtml"
    for src in gen_img_src(url):
        print(src)

if __name__ == "__main__":
    main()

OK - that's an overly complex answer for a question which requires a simple answer.

Right-click on image and save.... nothing more complex required

@Gribouillis - bear in mind that if the OP is even asking how to source the original image, it makes me wonder if he is even the owner of said image!

it doesn't work,when i run the code,it gave
Traceback (most recent call last):
File "C:/Users/ALEXIS/Desktop/extactphoto.py", line 5, in <module>
from lxml.html import soupparser
ImportError: No module named lxml.html

it doesn't work,when i run the code,it gave
Traceback (most recent call last):
File "C:/Users/ALEXIS/Desktop/extactphoto.py", line 5, in <module>
from lxml.html import soupparser
ImportError: No module named lxml.html

As I said, you need the lxml module.
@kaninelupus. there are different ways to understand the question. I don't think my solution is complex.

how to get lxml module?

how to get lxml module?

HERE

it seems complex and hard to install...

it seems complex and hard to install...

what's your os ? For windows, all you have to do is to download the MS windows installer for your version of pyton and your processor (32 or 64 bits) from http://pypi.python.org/pypi/lxml/2.2.2 and run the installer.

thx...how about beatifulsoup,how to install it?

Go here and download "Beautiful Soup version 3.1.0.1". This is a compressed .tar.gz file. To uncompress it with windows, you'll need 7zip from here. Right click on the BeautifulSoup.tar.gz and uncompress once, which should give you a file with suffix .tar, uncompress a second time and you should get a folder BeautifulSoup. Then copy BeautifulSoup.py to the site-packages directory in your python lib. It should work then.

thx...how about beatifulsoup,how to install it?

http://www.crummy.com/software/BeautifulSoup/

Somewhat simple minded, but you could try this and go from there ...

# retrieve the html code of a given website
# and check for potential image sources
# tested with Python 2.5.4

import urllib2

def extract(text, sub1, sub2):
    """
    extract a substring from text between first
    occurances of substrings sub1 and sub2
    """
    return text.split(sub1, 1)[-1].split(sub2, 1)[0]


url_str = 'http://www.it.usyd.edu.au/about/people/staff/tomc.shtml'
fin = urllib2.urlopen(url_str)
html = fin.read()
fin.close()
  
#print(html)  # test
html = html.lower()

while True:
    try:
        s = extract(html, '<img src=', '/>')
        print s
        if not s:
            break
        pos = html.find(s) + len(s)
        # slice to potential next image
        html = html[pos:]
    except:
        break
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.