Hi all,

First-timer and first post. I have been given a task of convert large data files to/from a self-defined format. In the "xml" form, the format for each record is:

	<name>  </name>
	<level>  </level>
	<kind>  </kind>
	<source>  </source>
	<flavor>  </flavor>
	<type>  </type>
	<keywords>  </keywords>
	<action>  </action>
	<attackTypeAndRange>   </attackTypeAndRange>
	<prerequisiteText>  </prerequisiteText>
	<target>   </target>
	<attack>    </attack>
	<attackModifier>  </attackModifier>
	<defense> </defense>
	<hit>  </hit>
	<hitDamageModifier>  </hitDamageModifier>
	<miss> </miss>

Each set of data contained within the <power></power> tags is considered one database "record". I've written code (see below) that tears through this type of file, strips the tags and spaces around the text, then joins them into one string (with </power> signifying end of line). The code that I have written works like a charm but it's a bit brute force and inelegant. This was proven when I was requested to maintain the tag structure of the data (ie: if there is no data in a given tag or if a tag isn't found inside a record, then I fill in the missing tag with "None" as the value, while maintaining the tag order. This is to allow us to re-import the data into a database and maintain the link order of our data to the form inside.

What I would like to do is the following:

Read in the .xml file
Strip the tabs and spaces out, leaving a clean string version of a line of text.
Find some way of comparing the order of tags of the list to the order of tags in my structure, and filling in the missing tag name with the value set to "None".

Join the data, between <Name> and <Notes>, into a single string (as in my code) and write out the resulting data to a file.

Questions about files:

1) Is there any way to read blocks of text that the user defines (in this case, read from <power> to </power>), do my comparison, then load in the next block?

2) Is a list the best way to handle the incoming file data?

Sorry if I am asking a lot. I've looked through all the Python docs and have tried varied approaches, but nothing seems to provide what I need.

Hoping anyone can help,


import re

# This will read in the Comma Separated file and sort out each Power database record
# Define the location of your xml filename below
xml_read = (open("C:/infile.xml"))
# Define the location of your csv save filename below
outFile = open('C:/outfile.csv','w')
def remove_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)
xmlEntry =[]
for line in xml_read:
        stripText = line.strip('\t')             
        stripTags =remove_tags(stripText)
        stripText = stripTags.strip()

        if stripText == "":
                xmlEntry.append ('\n')
                xmlEntry.append (finalText)

csvEntry = ''.join([('%s') %each for each in xmlEntry])
csvLine = (csvEntry.replace('""','","'))
print csvLine
# Done!

didn't you try to parse your xml file with a standard module like xml.dom.minidom ? You just need to write parse("C:/infile.xml") and you get a Document object from which you can extract your data...

I looked into it from the documentation but because I am not working in a "true" xml format, I didn't think it would help.

I should also state that I am beginner user and so am really cobbling together code from examples to get something going. Looking at the xml.dom.minidom has left me confused and with more questions than answers.

Thanks for the tip, anyways.

Solved it.

I created a (X) "dummy" list of the tags
Copied list(X) and added tags and entry "None" (<tag>None</tag>) making list(Y).
Copied list(Y) into list (Z).

Using the (X) index, I am able to match "startswith" from line to an index in (X) and insert the line in the appropriate index (Z). If the "startswith" hits the "</power>" tag, it joins the strings using the len of (X) -1, add CRLF character, then writes string out to file. I, then, reset list(Z) using "list(Z) = list(Y)[:]" to reset list with "<tag>None</tag>" items for the next "block" of text.

A lot of redundancy, yes, but it works like a dream and a lot easier to read than some of the XML parser documentation out there.