Hey everyone! I have been teaching myself Python, and so as an exercise, I have tried writing an image grabber for OneManga.com. You put in the path to the comic page you want to start the grab from, and it grabs every page from there to the end of the comic.

The code for it is below:

import urllib
from xml.dom import minidom
import os

#Get directory to save comics in
savePath = ''
while not os.path.exists(savePath):
    savePath = raw_input('Save in:')
stillGrabbing = True

#Get initial comic path
comicPath = raw_input('Path to first page:')
print '\nBeginning grab...'
while stillGrabbing:
    #Create page URL, get HTML data and create XML object
    nextURL = 'http://www.onemanga.com%s' % comicPath
    pageHTML = urllib.urlopen(nextURL)
    pageDoc = minidom.parse(pageHTML)

    #Search div elements for the comic
    divElements = pageDoc.getElementsByTagName('div')
    foundImage = 0
    for divTag in divElements:
        try:
            if divTag.attributes['class'].value == 'one-page':
                print '\nGrabbing comic from %s'% nextURL

                #Get image URL, split current comic path into name, chapter and page
                imageURL = divTag.getElementsByTagName('img')[0].attributes['src'].value
                foundImage = 1
                [a, comicNameJoined, comicChapter, comicPage, b] = comicPath.split('/')
                comicName = ' '.join(comicNameJoined.split('_'))

                #Create directory if needed, and download image
                if not os.path.exists('%s/%s/Chapter %s/' % (savePath, comicName, comicChapter)):
                    if not os.path.exists('%s/%s/' % (savePath, comicName)):
                        os.mkdir('%s/%s/' % (savePath, comicName))
                    os.mkdir('%s/%s/Chapter %s/' % (savePath, comicName, comicChapter))
                urllib.urlretrieve(imageURL, '%s/%s/Chapter %s/%s.jpg' % (savePath, comicName, comicChapter, comicPage))
                print
                #Get new comic path
                comicPath = divTag.getElementsByTagName('a')[0].attributes['href'].value
                break
        except KeyError:
            #Ignore div tags with no class attribute
            pass
    if not foundImage:
        print '\nFinished grab...'
        stillGrabbing = False

I have trialled this on my localhost web server and it works fine. The problem is, whenever I run it on pages from OneManga.com, I get the following error:

Traceback (most recent call last):
  File "<string>", line 244, in run_nodebug
  File "G:\Comic Webpages\comicRipper.py", line 22, in <module>
    pageDoc = minidom.parse(pageHTML)
  File "G:\Program Files\PortablePython_1.1_py2.5.4\App\lib\xml\dom\minidom.py", line 1915, in parse
    return expatbuilder.parse(file)
  File "G:\Program Files\PortablePython_1.1_py2.5.4\App\lib\xml\dom\expatbuilder.py", line 928, in parse
    result = builder.parseFile(file)
  File "G:\Program Files\PortablePython_1.1_py2.5.4\App\lib\xml\dom\expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 32, column 2

I *think* this means that either the xml parser is misinterpreting tags, or the webpage has mismatched tags that do not have any effect on web browsers like Firefox, which displays the page correctly.

My question is: is there a way of getting round this? Or is there another way of grabbing elements from HTML? All I need is the ability to get the following element from the page (actual element shown):

<div class="one-page">
    <a href="/Fairy_Tail/135/19/">
        <img class="manga-page" src="http://image.onemanga.com/010/mangas/00000022/000180942/18.jpg" alt="Loading... image010" />
    </a>
</div>

Thanks!
(Hopefully this post isn't too long!)

EDIT: Also, comicPath is set to something like /Fairy_Tail/135/19/.

Thankyou so much! BeautifulSoup was basically a direct substitution, and it simplified the code by removing the for loop. The program works great now!

This question has already been answered. Start a new discussion instead.