I need to convert .docx files to .doc (on Linux and Windows).
I'm planning to use the zip mod to access all of the internal XML documents.
Then I'll the /word/document.xml and I need to parse it so that it will read all of the text in the tags, place all of the text strings in a list, and then print the basic list.
Very simple stuff, xcept how do you actually parse an XML file?

from os import name, getcwd
cwd = getcwd()
if name != 'nt': dirType = '/'
else: dirType = '\\'
xml = open('%s%sword%sdocument.xml' % (cwd, dirType, dirType))
text = xml.read()
line = 0
size = len(xml)
while line != size:
.. text = xml[line]
.. line = (line+1)
.. repr(text[1])

is a pain. Does it even work??
So how do you parse an XML file?

Have you seen "Dive into Python" by Mark Pilgrim ? It's a fine resource and has an excellent section on XML processing. It's available online too..

Yeah, if you think XML is a pain, wait until you try to create the OLE file required for the .doc format. :lol: 300+ pages of documentation, and I gave up in disgust.


Thank you all for great replies!
Dive Into Python looks really good, I'll read it. Thanks, again.

Using Dive Into Python, I came up with this.
I decided to parse the file myself, its easier this way (to me).

import os, zipfile
print """DiXTiXa.py; GPLv3 Python DOCX to TXT Assembler
Converts MS Office Word 2007 files into text files quickly.
fp = raw_input('Enter in the file location (including C:\ or /home): ')
cwd = os.getcwd()
if os.name != 'nt': dir = '/'
else: dir = '\\'
print "\nDirectory type: %s or %s\n" % (os.name, dir)
	xml = zipfile.ZipFile("%s" % fp,'r')
	docx = xml.read('word%sdocument.xml' % dir)
except IOError:
	print """Bad file or file location.
	Please do not type any ' or " marks in the filepath!
	You entered: %s
	Please make sure this file exists.
Will now exit.""" % fp
docx = str(repr(docx))
end = len(docx)
line = 0
out = open('%s%sConverted.txt' % (cwd,dir),'a')
print 'The converted file is %s%sConverted.txt' % (cwd,dir)
def stripCrap(newline):
	new = newline.lstrip()
	return new.rstrip()
def openTag(beg, type):
	loc = docx.find('</w:t>',beg,end)
	if type == 1:
		newline = str(docx[beg+5:loc])
		out.write('%s\n' % stripCrap(newline))
		print newline
	elif type == 2:
		newline = str(docx[beg+26:loc])
		out.write('%s\n' % stripCrap(newline))
		print newline
	print "   Starting location: ",beg, "   Ending location: ", loc,"   Type: ", type, "  "	
while line < end:
	if docx[line:line+5] == '<w:t>': openTag(line,1)
	else: pass
	if docx[line:line+26] == '<w:t xml:space="preserve">': openTag(line,2)
	else: pass
	line = (line+1)
print "\nAll Done."

It works great on a file I tried.
Windows and Posix support!

not quite windows yet..

I found the bug but cant kill it.
Does anybody know how to use zipfile.ZipFile to access a file within a directory in a ZIP?

edit: this works great on Linux, i'd post the converted text file but that would be a waste of space..
the problem is probably related to 'directory types' (what I call it).
Linux does '/'; e.g '/home/user/Desktop' or '/usr/bin/'
Windows does '\': e.g 'C:\Program Files' or 'C:\Documents and Settings\user\Desktop'
but Python recognizes '\' as a string formatting character.
I know you're supposed to use '\\' for Windows, but I've tried that and its not working.

Any method to access the given .docx's word/document.xml file is welcome, even a full re-write (well its less than 50 lines...)

Ok, I fixed it :)!
It works great!
I decided to give up on converting it to DOC ;) but it converts to text files excellently.
I'll update the version I posted earlier after I get back on Linux to test the new one there.