•
•
•
•
What is DaniWeb IT Discussion Community?
You're currently browsing the Python section within the Software Development category of DaniWeb, a massive community of 403,206 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 3,606 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Python advertiser: Programming Forums
Views: 762 | Replies: 7
![]() |
•
•
Join Date: Oct 2007
Posts: 41
Reputation:
Rep Power: 1
Solved Threads: 5
I need to convert .docx files to .doc (on Linux and Windows).
I'm planning to use the zip mod to access all of the internal XML documents.
Then I'll the /word/document.xml and I need to parse it so that it will read all of the text in the tags, place all of the text strings in a list, and then print the basic list.
Very simple stuff, xcept how do you actually parse an XML file?
Using;
is a pain. Does it even work??
So how do you parse an XML file?
I'm planning to use the zip mod to access all of the internal XML documents.
Then I'll the /word/document.xml and I need to parse it so that it will read all of the text in the tags, place all of the text strings in a list, and then print the basic list.
Very simple stuff, xcept how do you actually parse an XML file?
Using;
from os import name, getcwd
cwd = getcwd()
if name != 'nt': dirType = '/'
else: dirType = '\\'
xml = open('%s%sword%sdocument.xml' % (cwd, dirType, dirType))
text = xml.read()
line = 0
repr(xml)
size = len(xml)
while line != size:
.. text = xml[line]
.. line = (line+1)
.. repr(text[1])So how do you parse an XML file?
"And da wind cry moron." ~ Pearls Before Swine
•
•
Join Date: Mar 2008
Posts: 1
Reputation:
Rep Power: 0
Solved Threads: 0
I found xml2obj easy.
http://aspn.activestate.com/ASPN/Coo.../Recipe/149368
http://aspn.activestate.com/ASPN/Coo.../Recipe/149368
•
•
Join Date: Oct 2007
Posts: 41
Reputation:
Rep Power: 1
Solved Threads: 5
Using Dive Into Python, I came up with this.
I decided to parse the file myself, its easier this way (to me).
It works great on a file I tried.
Windows and Posix support!
I decided to parse the file myself, its easier this way (to me).
python Syntax (Toggle Plain Text)
import os, zipfile print """DiXTiXa.py; GPLv3 Python DOCX to TXT Assembler Converts MS Office Word 2007 files into text files quickly. """ fp = raw_input('Enter in the file location (including C:\ or /home): ') cwd = os.getcwd() if os.name != 'nt': dir = '/' else: dir = '\\' print "\nDirectory type: %s or %s\n" % (os.name, dir) try: xml = zipfile.ZipFile("%s" % fp,'r') docx = xml.read('word%sdocument.xml' % dir) xml.close() except IOError: print """Bad file or file location. Please do not type any ' or " marks in the filepath! You entered: %s Please make sure this file exists. Will now exit.""" % fp exit docx = str(repr(docx)) end = len(docx) line = 0 out = open('%s%sConverted.txt' % (cwd,dir),'a') print 'The converted file is %s%sConverted.txt' % (cwd,dir) def stripCrap(newline): new = newline.lstrip() return new.rstrip() def openTag(beg, type): loc = docx.find('</w:t>',beg,end) if type == 1: newline = str(docx[beg+5:loc]) out.write('%s\n' % stripCrap(newline)) print newline elif type == 2: newline = str(docx[beg+26:loc]) out.write('%s\n' % stripCrap(newline)) print newline print " Starting location: ",beg, " Ending location: ", loc," Type: ", type, " " while line < end: if docx[line:line+5] == '<w:t>': openTag(line,1) else: pass if docx[line:line+26] == '<w:t xml:space="preserve">': openTag(line,2) else: pass line = (line+1) out.close() print "\nAll Done."
Windows and Posix support!
Last edited by 1337455 10534 : Mar 19th, 2008 at 9:41 pm.
"And da wind cry moron." ~ Pearls Before Swine
•
•
Join Date: Oct 2007
Posts: 41
Reputation:
Rep Power: 1
Solved Threads: 5
hmmm..
not quite windows yet..
I found the bug but cant kill it.
Does anybody know how to use zipfile.ZipFile to access a file within a directory in a ZIP?
edit: this works great on Linux, i'd post the converted text file but that would be a waste of space..
the problem is probably related to 'directory types' (what I call it).
Linux does '/'; e.g '/home/user/Desktop' or '/usr/bin/'
Windows does '\': e.g 'C:\Program Files' or 'C:\Documents and Settings\user\Desktop'
but Python recognizes '\' as a string formatting character.
I know you're supposed to use '\\' for Windows, but I've tried that and its not working.
Any method to access the given .docx's word/document.xml file is welcome, even a full re-write (well its less than 50 lines...)
not quite windows yet..
I found the bug but cant kill it.
Does anybody know how to use zipfile.ZipFile to access a file within a directory in a ZIP?
edit: this works great on Linux, i'd post the converted text file but that would be a waste of space..
the problem is probably related to 'directory types' (what I call it).
Linux does '/'; e.g '/home/user/Desktop' or '/usr/bin/'
Windows does '\': e.g 'C:\Program Files' or 'C:\Documents and Settings\user\Desktop'
but Python recognizes '\' as a string formatting character.
I know you're supposed to use '\\' for Windows, but I've tried that and its not working.
Any method to access the given .docx's word/document.xml file is welcome, even a full re-write (well its less than 50 lines...)
Last edited by 1337455 10534 : Mar 22nd, 2008 at 5:44 pm.
"And da wind cry moron." ~ Pearls Before Swine
![]() |
•
•
•
•
•
•
•
•
DaniWeb Python Marketplace
•
•
•
•
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
- Need a PHP programmer with XML experience. (Web Development Job Offers)
- Read XML File in C/C++ (C++)
- xml parsing (XML, XSLT and XPATH)
- XML Parsing Error: junk after document element (XML, XSLT and XPATH)
- xml parsing from java (Java)
- Searching XML documents with PHP (PHP)
- how xml is useful? (RSS, Web Services and SOAP)
- Java and XML (Java)
Other Threads in the Python Forum
- Previous Thread: The "Did you mean" feature
- Next Thread: Slider widget without the knob?


!
but it converts to text files excellently.
Linear Mode