User Name Password Register
DaniWeb IT Discussion Community
All
What is DaniWeb IT Discussion Community?
You're currently browsing the Python section within the Software Development category of DaniWeb, a massive community of 403,206 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 3,606 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Python advertiser: Programming Forums
Views: 762 | Replies: 7
Reply
Join Date: Oct 2007
Posts: 41
Reputation: 1337455 10534 is an unknown quantity at this point 
Rep Power: 1
Solved Threads: 5
1337455 10534 1337455 10534 is offline Offline
Light Poster

XML parsing

  #1  
Mar 16th, 2008
I need to convert .docx files to .doc (on Linux and Windows).
I'm planning to use the zip mod to access all of the internal XML documents.
Then I'll the /word/document.xml and I need to parse it so that it will read all of the text in the tags, place all of the text strings in a list, and then print the basic list.
Very simple stuff, xcept how do you actually parse an XML file?
Using;
from os import name, getcwd
cwd = getcwd()
if name != 'nt': dirType = '/'
else: dirType = '\\'
xml = open('%s%sword%sdocument.xml' % (cwd, dirType, dirType))
text = xml.read()
line = 0
repr(xml)
size = len(xml)
while line != size:
.. text = xml[line]
.. line = (line+1)
.. repr(text[1])
is a pain. Does it even work??
So how do you parse an XML file?
"And da wind cry moron." ~ Pearls Before Swine
AddThis Social Bookmark Button
Reply With Quote  
Join Date: Mar 2008
Posts: 8
Reputation: bgeddy is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
bgeddy's Avatar
bgeddy bgeddy is offline Offline
Newbie Poster

Re: XML parsing

  #2  
Mar 16th, 2008
Have you seen "Dive into Python" by Mark Pilgrim ? It's a fine resource and has an excellent section on XML processing. It's available online too..
Reply With Quote  
Join Date: Mar 2008
Posts: 1
Reputation: abhi166 is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
abhi166 abhi166 is offline Offline
Newbie Poster

Re: XML parsing

  #3  
Mar 17th, 2008
Reply With Quote  
Join Date: Jul 2006
Posts: 562
Reputation: jrcagle is on a distinguished road 
Rep Power: 4
Solved Threads: 72
jrcagle jrcagle is offline Offline
Posting Pro

Re: XML parsing

  #4  
Mar 17th, 2008
Yeah, if you think XML is a pain, wait until you try to create the OLE file required for the .doc format. :lol: 300+ pages of documentation, and I gave up in disgust.

Jeff
Reply With Quote  
Join Date: Oct 2007
Posts: 41
Reputation: 1337455 10534 is an unknown quantity at this point 
Rep Power: 1
Solved Threads: 5
1337455 10534 1337455 10534 is offline Offline
Light Poster

Re: XML parsing

  #5  
Mar 18th, 2008
Thank you all for great replies!
Dive Into Python looks really good, I'll read it. Thanks, again.
"And da wind cry moron." ~ Pearls Before Swine
Reply With Quote  
Join Date: Oct 2007
Posts: 41
Reputation: 1337455 10534 is an unknown quantity at this point 
Rep Power: 1
Solved Threads: 5
1337455 10534 1337455 10534 is offline Offline
Light Poster

Re: XML parsing

  #6  
Mar 19th, 2008
Using Dive Into Python, I came up with this.
I decided to parse the file myself, its easier this way (to me).
  1. import os, zipfile
  2. print """DiXTiXa.py; GPLv3 Python DOCX to TXT Assembler
  3. Converts MS Office Word 2007 files into text files quickly.
  4. """
  5. fp = raw_input('Enter in the file location (including C:\ or /home): ')
  6. cwd = os.getcwd()
  7. if os.name != 'nt': dir = '/'
  8. else: dir = '\\'
  9. print "\nDirectory type: %s or %s\n" % (os.name, dir)
  10. try:
  11. xml = zipfile.ZipFile("%s" % fp,'r')
  12. docx = xml.read('word%sdocument.xml' % dir)
  13. xml.close()
  14. except IOError:
  15. print """Bad file or file location.
  16. Please do not type any ' or " marks in the filepath!
  17. You entered: %s
  18. Please make sure this file exists.
  19. Will now exit.""" % fp
  20. exit
  21. docx = str(repr(docx))
  22. end = len(docx)
  23. line = 0
  24. out = open('%s%sConverted.txt' % (cwd,dir),'a')
  25. print 'The converted file is %s%sConverted.txt' % (cwd,dir)
  26. def stripCrap(newline):
  27. new = newline.lstrip()
  28. return new.rstrip()
  29. def openTag(beg, type):
  30. loc = docx.find('</w:t>',beg,end)
  31. if type == 1:
  32. newline = str(docx[beg+5:loc])
  33. out.write('%s\n' % stripCrap(newline))
  34. print newline
  35. elif type == 2:
  36. newline = str(docx[beg+26:loc])
  37. out.write('%s\n' % stripCrap(newline))
  38. print newline
  39. print " Starting location: ",beg, " Ending location: ", loc," Type: ", type, " "
  40. while line < end:
  41. if docx[line:line+5] == '<w:t>': openTag(line,1)
  42. else: pass
  43. if docx[line:line+26] == '<w:t xml:space="preserve">': openTag(line,2)
  44. else: pass
  45. line = (line+1)
  46. out.close()
  47. print "\nAll Done."
  48.  
It works great on a file I tried.
Windows and Posix support!
Last edited by 1337455 10534 : Mar 19th, 2008 at 9:41 pm.
"And da wind cry moron." ~ Pearls Before Swine
Reply With Quote  
Join Date: Oct 2007
Posts: 41
Reputation: 1337455 10534 is an unknown quantity at this point 
Rep Power: 1
Solved Threads: 5
1337455 10534 1337455 10534 is offline Offline
Light Poster

Re: XML parsing

  #7  
Mar 22nd, 2008
hmmm..
not quite windows yet..

I found the bug but cant kill it.
Does anybody know how to use zipfile.ZipFile to access a file within a directory in a ZIP?

edit: this works great on Linux, i'd post the converted text file but that would be a waste of space..
the problem is probably related to 'directory types' (what I call it).
Linux does '/'; e.g '/home/user/Desktop' or '/usr/bin/'
Windows does '\': e.g 'C:\Program Files' or 'C:\Documents and Settings\user\Desktop'
but Python recognizes '\' as a string formatting character.
I know you're supposed to use '\\' for Windows, but I've tried that and its not working.

Any method to access the given .docx's word/document.xml file is welcome, even a full re-write (well its less than 50 lines...)
Last edited by 1337455 10534 : Mar 22nd, 2008 at 5:44 pm.
"And da wind cry moron." ~ Pearls Before Swine
Reply With Quote  
Join Date: Oct 2007
Posts: 41
Reputation: 1337455 10534 is an unknown quantity at this point 
Rep Power: 1
Solved Threads: 5
1337455 10534 1337455 10534 is offline Offline
Light Poster

Re: XML parsing

  #8  
Mar 24th, 2008
Ok, I fixed it !
It works great!
I decided to give up on converting it to DOC but it converts to text files excellently.
I'll update the version I posted earlier after I get back on Linux to test the new one there.
"And da wind cry moron." ~ Pearls Before Swine
Reply With Quote  
Reply

Only community members can participate in forum threads. You must register or log in to contribute.

DaniWeb Python Marketplace
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 

Thread Tools Display Modes

Similar Threads
Other Threads in the Python Forum

All times are GMT -4. The time now is 4:34 am.
Forum system based on vBulletin Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
©2003 - 2008 DaniWeb® LLC