XML parsing

Question

1337455 10534 0 Light Poster

16 Years Ago

I need to convert .docx files to .doc (on Linux and Windows).
I'm planning to use the zip mod to access all of the internal XML documents.
Then I'll the /word/document.xml and I need to parse it so that it will read all of the text in the tags, place all of the text strings in a list, and then print the basic list.
Very simple stuff, xcept how do you actually parse an XML file?
Using;

from os import name, getcwd
cwd = getcwd()
if name != 'nt': dirType = '/'
else: dirType = '\\'
xml = open('%s%sword%sdocument.xml' % (cwd, dirType, dirType))
text = xml.read()
line = 0
repr(xml)
size = len(xml)
while line != size:
.. text = xml[line]
.. line = (line+1)
.. repr(text[1])

is a pain. Does it even work??
So how do you parse an XML file?

python

4 Contributors
7 Replies
135 Views
1 Week Discussion Span
Latest Post 16 Years Ago Latest Post by 1337455 10534

All 7 Replies

bgeddy 0 Newbie Poster

16 Years Ago

Have you seen "Dive into Python" by Mark Pilgrim ? It's a fine resource and has an excellent section on XML processing. It's available online too..

jrcagle 77 Practically a Master Poster

16 Years Ago

Yeah, if you think XML is a pain, wait until you try to create the OLE file required for the .doc format. :lol: 300+ pages of documentation, and I gave up in disgust.

Jeff

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

abhi166 0 Newbie Poster · Answer 1 · 2008-03-17T22:37:54+00:00

I found xml2obj easy.
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/149368

1337455 10534 0 Light Poster · Answer 2 · 2008-03-19T05:18:51+00:00

Thank you all for great replies!
Dive Into Python looks really good, I'll read it. Thanks, again.

1337455 10534 0 Light Poster · Answer 3 · 2008-03-20T07:39:42+00:00

Using Dive Into Python, I came up with this.
I decided to parse the file myself, its easier this way (to me).

import os, zipfile
print """DiXTiXa.py; GPLv3 Python DOCX to TXT Assembler
Converts MS Office Word 2007 files into text files quickly.
"""
fp = raw_input('Enter in the file location (including C:\ or /home): ')
cwd = os.getcwd()
if os.name != 'nt': dir = '/'
else: dir = '\\'
print "\nDirectory type: %s or %s\n" % (os.name, dir)
try:
	xml = zipfile.ZipFile("%s" % fp,'r')
	docx = xml.read('word%sdocument.xml' % dir)
	xml.close()
except IOError:
	print """Bad file or file location.
	Please do not type any ' or " marks in the filepath!
	You entered: %s
	Please make sure this file exists.
Will now exit.""" % fp
	exit
docx = str(repr(docx))
end = len(docx)
line = 0
out = open('%s%sConverted.txt' % (cwd,dir),'a')
print 'The converted file is %s%sConverted.txt' % (cwd,dir)
def stripCrap(newline):
	new = newline.lstrip()
	return new.rstrip()
def openTag(beg, type):
	loc = docx.find('</w:t>',beg,end)
	if type == 1:
		newline = str(docx[beg+5:loc])
		out.write('%s\n' % stripCrap(newline))
		print newline
	elif type == 2:
		newline = str(docx[beg+26:loc])
		out.write('%s\n' % stripCrap(newline))
		print newline
	print "   Starting location: ",beg, "   Ending location: ", loc,"   Type: ", type, "  "	
while line < end:
	if docx[line:line+5] == '<w:t>': openTag(line,1)
	else: pass
	if docx[line:line+26] == '<w:t xml:space="preserve">': openTag(line,2)
	else: pass
	line = (line+1)
out.close()
print "\nAll Done."

It works great on a file I tried.
Windows and Posix support!

1337455 10534 0 Light Poster · Answer 4 · 2008-03-23T03:38:25+00:00

hmmm..
not quite windows yet..

I found the bug but cant kill it.
Does anybody know how to use zipfile.ZipFile to access a file within a directory in a ZIP?

edit: this works great on Linux, i'd post the converted text file but that would be a waste of space..
the problem is probably related to 'directory types' (what I call it).
Linux does '/'; e.g '/home/user/Desktop' or '/usr/bin/'
Windows does '\': e.g 'C:\Program Files' or 'C:\Documents and Settings\user\Desktop'
but Python recognizes '\' as a string formatting character.
I know you're supposed to use '\\' for Windows, but I've tried that and its not working.

Any method to access the given .docx's word/document.xml file is welcome, even a full re-write (well its less than 50 lines...)

1337455 10534 0 Light Poster · Answer 5 · 2008-03-25T02:36:10+00:00

Ok, I fixed it :)!
It works great!
I decided to give up on converting it to DOC ;) but it converts to text files excellently.
I'll update the version I posted earlier after I get back on Linux to test the new one there.

XML parsing

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers