Troubleshooting utility to parse XML objects

Question

Saran_1 0 Junior Poster in Training

8 Years Ago

I am currently troubleshooting a utility that I have been working on. The main file of the utility is below. Please not that `flatten_dict` and `makerows` are seperate files and functions, respectively. 

My objectives are to:

* Recursively traverse a directory - done
* Find XML and text files only - any thoughts on how to build this into the function below would be very helpful
* Print these files out to the console - done
* Apply `flatten_dict` to take each of the txt or XML files and 

   1) parse it into a string of key value pairs (done) and 
    2) `makerows` to write out to CSV (done). 

 **MAIN CONUNDRUM**: How do I modify the below syntax for writing out to CSV:

    writer = csv.writer(open("save.csv", 'wt'))
    writer.writerows(self.makerows(flatten_dict(root)))

such that I can write out an individual CSV for each XML/text file that is inputted for processing?


    # import the os.path library
    import os.path
    #import the sys library
    import sys
    from parsexml2 import flatten_dict, ElementTree
    import csv


    # The class name
    class IterateFiles(object):

    #helper function for generator object for writing out to CSV

        def makerows(self, pairs):
            #write out to CSV
            headers = []
            columns = {}
            for k, v in pairs:
                if k in columns:
                    columns[k].extend((v,))
                else:
                    headers.append(k)
                    columns[k] = [k, v]
            m = max(len(c) for c in columns.values())
            for c in columns.values():
                c.extend('' for i in range(len(c), m))
            L = [columns[k] for k in headers]
            rows = list(zip(*L))
            return rows

        def open_and_parse(self, filename):       

            try:
                with open(filename, 'r', encoding='utf-8') as f: 
                    xml_string = f.read() 
                    xml_string= xml_string.replace('�', '') #optional to remove ampersands. 
                    root = ElementTree.XML(xml_string)
                    for item in root:
                        print(root)
                        writer = csv.writer(open("save.csv", 'wt'))
                        writer.writerows(self.makerows(flatten_dict(root)))
            except:
                raise IOError("it's monday and the sun is shining")

        # A function which iterates through the directory
        def findFiles(self, directory):
            # check whether the current directory exits
            if os.path.exists(directory):
                # check whether the given directory is a directory
                if os.path.isdir(directory):
                    # list all the files within the directory
                    dirFileList = os.listdir(directory)
                    # Loop through the individual files within the directory
                    for filename in dirFileList:
                        # Check whether file is directory or file
                        if(os.path.isdir(os.path.join(directory,filename))):
                            print(os.path.join(directory,filename) + \
                            ' is a directory and therefore ignored!')
                        elif(os.path.isfile(os.path.join(directory,filename))):
                             # print(os.path.join(directory,filename))
                             print(os.path.basename(filename))
                             self.open_and_parse(filename)
                        else:
                            print(filename + ' is NOT a file or directory!')
                else:
                    print(directory + ' is not a directory!')
            else:
                print(directory + ' does not exist!')   


        def run(self):
            # Set the folder to search
            searchFolder = 'C:\\Users\\samples\\'
            self.findFiles(searchFolder)

    # Run the script from command line – note the two underscores
    if __name__ == '__main__':
        obj = IterateFiles()
        obj.run()


Thanks.

csv os python sys

2 Contributors
21 Replies
411 Views
2 Days Discussion Span
Latest Post 8 Years Ago Latest Post by Saran_1

All 21 Replies

Gribouillis 1,391 Programming Explorer

8 Years Ago

You can write

    dest = self.destination_csv(filename)
    with open(dest, 'wt') as fh:
        writer = csv.writer(fh)
        writer.writerows(self.makerows(flatten_dict(root)))

Then you need a method

    def destination_csv(self, filename):
        """Compute a destination filename from a source filename

        for example if filename is
            C:\foo\bar\baz\awesomedata.xml

        the result could be
            C:\foo\bar\baz\CSV\awesomedata.csv
        """

use function from module os.path and string operations to compute the destination file.

Edited 8 Years Ago by Gribouillis

Gribouillis 1,391 Programming Explorer

8 Years Ago

You did not understand my advice the function destination_csv() is only supposed to take a filename argument (such as C:\foo\bar\baz\awesomedata.xml) and return another string, such as "C:\\Users\\Desktop\\Playground\\Samples\\CSV_Records\\awesomedata.csv". It is not at all supposed to open or parse the file.

Gribouillis 1,391 Programming Explorer

8 Years Ago

The directory needs to be made only once. destination_csv() does not rename the XML file. It only creates a new destination filename where the csv data will be written without modifying the source XML file. The name of the destination file is built from the name of the source file, which permits to handle several files which names don't collide.

Edit: destination_csv() is not at all recursive. It handles only a single file name.

Edited 8 Years Ago by Gribouillis

Gribouillis 1,391 Programming Explorer

8 Years Ago

You don't need to change the initial findFiles() function which calls open_and_parse(). The destination_csv() function must not rename any file nor iterate over a listdir etc.

In your original code, you only need to replace lines 60 and 61 with

    dest = self.destination_csv(filename)
    with open(dest, 'wt') as fh:
        writer = csv.writer(fh)
        writer.writerows(self.makerows(flatten_dict(root)))

I think you still don't understand what destination_csv() should do.

Edited 8 Years Ago by Gribouillis

Gribouillis 1,391 Programming Explorer

8 Years Ago

Indentation of line 87 is incorrect. Now here is an hypothetical example which shows how destination_csv() should work

>>> obj.destination_csv('C:\\Playground\\Samples\\FOO.xml')
C:\Playground\Samples\CSV_Reports\FOO.csv

Edit: I shortened the path for the example.

Edited 8 Years Ago by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Saran_1 0 Junior Poster in Training · Answer 1 · 2015-06-30T19:04:58+00:00

Here is my version (I know it is rather clunky and inefficent)

Two issues:

The first function check_config takes a configuration file (txt) which has the filenames of files I want to exclude from further processing. I attempt to move these files in the intersection of files_in_dir and lines (which contains the names of the files). I then attempt to move them. However, the intersection files seem to not move. They are still present in the directory.

    def check_config(self, conf_file):
        path = '.'
        lines = [line.rstrip('\n') for line in open(conf_file)]
        files_in_dir = [f for f in os.listdir(path) if f.endswith('txt') or f.endswith('xml')]
        intersection = [item for item in lines if item[:].strip() in [item[:].strip() for item in files_in_dir]]
        print(intersection)
        dst_dir = 'C:\\Users\\temp'
        for file in intersection:
            try:
                os.mkdir(dst_dir)
            except OSError as e:
                if e.errno == errno.EEXIST:
                    break
            basename = os.path.basename(file)
            head, tail = os.path.splitext(basename)
            dst_file = os.path.join(dst_dir, basename)
        # rename if necessary
            count = 0
            while os.path.exists(dst_file):
                count += 1
                dst_file = os.path.join(dst_dir, '%s-%d%s' % (head, count, tail))
            print('Renaming %s to %s' % (file, dst_file))
            print(dst_file)
            os.rename(file, dst_file)

Second issue:

I took your advice and came up with this:

    def destination_csv(self, filename):
        """Compute a destination filename from a source filename"""

        """for example if filename is
            C:\foo\bar\baz\awesomedata.xml
        the result could be
            C:\foo\bar\baz\CSV\awesomedata.csv
        """
        files_in_dir = [f for f in os.listdir(path) if f.endswith('txt') or f.endswith('xml')]
        destination_dir = "C:\\Users\\Desktop\\Playground\\Samples\\CSV_Records"
        for filename in files_in_dir:
            try:
                os.mkdir(destination_dir)
                if filename in files_in_dir:
                    return open_and_parse(filename)
                else:
                    raise ValueError("Not a valid filetype")
            except:
                raise OSError("Directory may already exist")

    def open_and_parse(self, filename):       

        try:
            with open(filename, 'r', encoding='utf-8') as f: 
                xml_string = f.read() 
                xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
                root = ElementTree.XML(xml_string)
                for item in root:
                    # print(item)
                    dest = self.destination_csv(filename)
                with open(dest, "wt") as fh:
                    writer = csv.writer(fh)
                    writer.writerows(self.makerows(flatten_dict(root)))
        except:
            raise IOError("it's monday and the sun is shining")

I am wondering if this is the best way to do this?
Thank you, in advance, for your time and for your feedback.

Saran_1 0 Junior Poster in Training · Answer 2 · 2015-06-30T19:46:34+00:00

So destination_csv() would recursively take each txt or XML file and rename it? I would only need to make the directory once. This should be outside the iterative loop, correct?

Saran_1 0 Junior Poster in Training · Answer 3 · 2015-06-30T20:09:52+00:00

I have tried this:

        folder = '"C:\\Users\\wynsa2\\Desktop\\Playground\\Samples\\'
        for filename in os.listdir(folder):
            infilename = os.path.join(folder,filename)
            if not os.path.isfile(infilename): continue
            oldbase = os.path.splitext(filename)
            newname = infilename.replace('.txt', '.csv')
            output = os.rename(infilename, newname)
            print(output)

I receive an OS Error: WinError 123: The filename, directory name or volume label syntax is incorrect "C:\\Users\\wynsa2\\Desktop\\Playground\\Samples\\

I am also wondering how I would call the open_and_parse() function?

Saran_1 0 Junior Poster in Training · Answer 4 · 2015-06-30T20:26:35+00:00

I have already done so (please see lines 87 - 90). I am unclear as to the purpose of destination_csv(). Code Example Here`

    # import the os.path library
    import os.path
    #import the sys library
    import sys
    # from parsexml2 import flatten_dict, ElementTree
    import csv
    import os
    import shutil
    import fnmatch
    import errno

    # The class name
    class IterateFiles(object):

    #helper function for generator object for writing out to CSV
        def makerows(self, pairs):
            #write out to CSV
            headers = []
            columns = {}
            for k, v in pairs:
                if k in columns:
                    columns[k].extend((v,))
                else:
                    headers.append(k)
                    columns[k] = [k, v]
            m = max(len(c) for c in columns.values())
            for c in columns.values():
                c.extend('' for i in range(len(c), m))
            L = [columns[k] for k in headers]
            rows = list(zip(*L))
            return rows


        def check_config(self, conf_file):
            path = '.'
            lines = [line.rstrip('\n') for line in open(conf_file)]
            files_in_dir = [f for f in os.listdir(path) if f.endswith('txt') or f.endswith('xml')]
            intersection = [item for item in lines if item[:].strip() in [item[:].strip() for item in files_in_dir]]
            print(intersection)
            dst_dir = 'C:\\Users\\wynsa2\\temp\\'
            for file in intersection:
                try:      
                    os.mkdir(dst_dir)
                except:
                    raise IOError("Directory already present")
            # rename if necessary
                basename = os.path.basename(file)
                head, tail = os.path.splitext(basename)
                dst_file = os.path.join(dst_dir, basename)
                count = 0
                while os.path.exists(dst_file):
                    count += 1
                    dst_file = os.path.join(dst_dir, '%s-%d%s' % (head, count, tail))
                    print('Renaming %s to %s' % (file, dst_file))
                    print(dst_file)
                    os.rename(file, dst_file)


        def destination_csv(self, filename):
            """Compute a destination filename from a source filename"""

            """for example if filename is
                C:\foo\bar\baz\awesomedata.xml
            the result could be
                C:\foo\bar\baz\CSV\awesomedata.csv
            """
            mkdir("C:\\Users\\wynsa2\\Desktop\\Playground\\Samples\\CSV_Reports\\")
            folder = '"C:\\Users\\wynsa2\\Desktop\\Playground\\Samples\\'
            for filename in os.listdir(folder):
                infilename = os.path.join(folder,filename)
                if not os.path.isfile(infilename): continue`
                oldbase = os.path.splitext(filename)
                newname = infilename.replace('.txt', '.csv')
                output = os.rename(infilename, newname)
                print(output)


        def open_and_parse(self, filename):       

            try:
                with open(filename, 'r', encoding='utf-8') as f: 
                    xml_string = f.read() 
                    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
                    root = ElementTree.XML(xml_string)
                    for item in root:
                        print(item)
                        dest = self.destination_csv(filename)
                    with open(dest, "wt") as fh:
                        writer = csv.writer(fh)
                        writer.writerows(self.makerows(flatten_dict(root)))
            except:
                raise IOError("it's monday and the sun is shining")



        # A function which iterates through the directory
        def findFiles(self, directory):
            # check whether the current directory exits
            if os.path.exists(directory):
                # check whether the given directory is a directory
                if os.path.isdir(directory):
                    # list all the files within the directory
                    dirFileList = os.listdir(directory)
                    # Loop through the individual files within the directory
                    for filename in dirFileList:
                        # Check whether file is directory or file
                        if(os.path.isdir(os.path.join(directory,filename))):
                            print(os.path.join(directory,filename) + \
                            ' is a directory and therefore ignored!')
                        elif(os.path.isfile(os.path.join(directory,filename))):
                             # print(os.path.join(directory,filename))
                             print(os.path.basename(filename))
                             self.open_and_parse(filename)
                        else:
                            print(filename + ' is NOT a file or directory!')
                else:
                    print(directory + ' is not a directory!')
            else:
                print(directory + ' does not exist!')   



        def run(self):
            # Set the folder to search
            searchFolder = 'C:\\Users\\wynsa2\\Desktop\\Playground\\Samples\\'
            self.findFiles(searchFolder)

    # Run the script from command line – note the two underscores
    if __name__ == '__main__':
        obj = IterateFiles()
        obj.destination_csv("2-Response.txt")
        # obj.check_config("C:\\Users\\wynsa2\\Desktop\\Playground\\Samples\\config.txt")
        #obj.run()

Saran_1 0 Junior Poster in Training · Answer 5 · 2015-06-30T20:42:46+00:00

Thanks for catching that. I am trying to emulate what you have provided me. Am I on the right path? My apologies for the identation inconsistencies.

    name, extension = os.path.splitext(filename)                                
        if extension == ".txt":                                                       
            dest_filename = os.path.join(dest, filename)                            
            if not os.path.isfile(dest_filename):                                      
                # We copy the file as is
                shutil.copy(os.path.join(source, filename) , dest)                  
            else:                       
                # We rename the file with a number in the name incrementing the number until we find one that is not used. 
                # This should be moved to a separate function to avoid code duplication when handling the different file extensions                                   
                i = 0                                                                  
                dest_filename = os.path.join(dest, "%s%d%s" % (name, i, extension)) 
                while os.path.isfile(dest_filename):                                   
                    i += 1                                                                                                                      
                    dest_filename = os.path.join(dest, "%s%d%s" % (name, i, extension))
                shutil.copy(os.path.join(source, filename), dest_filename)

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 6 · 2015-06-30T20:51:49+00:00

You dont need to copy or move or rename any file. You only need to create a string, the system path to the destination file. You don't need to create or write this destination file. It will be written later by the call to writerows().

Saran_1 0 Junior Poster in Training · Answer 7 · 2015-06-30T23:26:41+00:00

Ahhhhhhhh!!!!! I see it now - it's amazing what time away from the screen does!!! Thank you for your patience!

Saran_1 0 Junior Poster in Training · Answer 8 · 2015-07-01T00:19:09+00:00

Follow-up questions:

When you say "create a string" do you mean that I dictate the relative path?

Is so, since I am a Windows user:

import glob
import os.path
import os

def destination_csv(self, filename):

    CSV_Dir = os.mkdir("C:\\Users\\Desktop\\Playground\\Samples\\CSV_Reports\\")
    XML_Folder = '"C:\\Users\\Desktop\\Playground\\Samples\\'

    for filename in XML_Folder:

        base_filename= glob.glob("C:\\Users\\Desktop\\Playground\\Samples/*.csv")
        dir_name = "C:\\Users\\Desktop\\Playground\\Samples\\CSV_Reports\\"
        filename_suffix = '.csv'
        os.path.join(dir_name, base_filename + filename_suffix)

The only issues is that when I run this I receive a TypeError: can only concatenate list (not "str") to list. This error occurs as part of the entire function suite:

def open_and_parse(self, filename):       

        try:
            with open(filename, 'r', encoding='utf-8') as f: 
                xml_string = f.read() 
                xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
                root = ElementTree.XML(xml_string)
                for item in root:
                    print(item)
                dest = self.destination_csv(filename)
                with open(dest, "wt") as fh:
                    writer = csv.writer(fh)
                    writer.writerows(self.makerows(flatten_dict(root)))
        except:
            raise IOError("it's monday and the sun is shining")

and

# A function which iterates through the directory
    def findFiles(self, directory):
        # check whether the current directory exits
        if os.path.exists(directory):
            # check whether the given directory is a directory
            if os.path.isdir(directory):
                # list all the files within the directory
                dirFileList = os.listdir(directory)
                # Loop through the individual files within the directory
                for filename in dirFileList:
                    # Check whether file is directory or file
                    if(os.path.isdir(os.path.join(directory,filename))):
                        print(os.path.join(directory,filename) + \
                        ' is a directory and therefore ignored!')
                    elif(os.path.isfile(os.path.join(directory,filename))):
                         # print(os.path.join(directory,filename))
                         print(os.path.basename(filename))
                         self.open_and_parse(filename)
                    else:
                        print(filename + ' is NOT a file or directory!')
            else:
                print(directory + ' is not a directory!')
        else:
            print(directory + ' does not exist!')

Followup question - is it possible to abstract this so I don't have to explictly dictate a path? For example, instead of the current directory, perhaps use os.getcwd()?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 9 · 2015-07-01T05:22:13+00:00

There is still a for loop in destination_csv(). You dont need it. This function handles only a single filename which is already a parameter of the function.

do you mean that I dictate the relative path?

Your program takes XML files as input and writes CSV files as output. You are the only one to know where you want to store the csv files on your file system and you must tell this to your program. This is what destination_csv() does.

The computer cannot guess where the output file must be written.

Write a path to one of your xml source files and the corresponding path to the output csv file that you want. destination_csv() must be able to transform the one into the other.

Saran_1 0 Junior Poster in Training · Answer 10 · 2015-07-01T10:58:46+00:00

I understand the purpose now (I was overthinking and overengineering this problem wasn't I?)

Is this all I have to do? I have to include the absolute path, correct? Moreover, I do need to join the source file path to the CSV file path. I am not quite sure which method I would use in os.path to acheive this. This is what I have so far:

def destination_csv(self, filename):

        CSV_Dir = os.mkdir("C:\\Users\\Desktop\\Playground\\Samples\\CSV_Reports\\")
        CSV_Folder = os.join(CSV_Dir, "10-Response.csv")
        XMLSource_File_Folder = "C:\\Users\\Desktop\\Playground\\Samples\\10-Response.txt"


        """Compute a destination filename from a source filename"""
        """for example if filename is
            C:\foo\bar\baz\awesomedata.xml
        the result could be
            C:\foo\bar\baz\CSV\awesomedata.csv
        """

Saran_1 0 Junior Poster in Training · Answer 11 · 2015-07-01T13:07:43+00:00

Follow-up edit:. My question regarding a method that would take all of the files in my directory and follow through this procedure is still intriguing. I have 64 files. Does this mean that I have to individually dictate the path? I thought that by using os.path.basename() or the suffix (as I did above would allow me to continually iterate throught the source file directory and then connect to the CSV's path directory.

    def destination_csv(self, filename):
        CSV_Dir = os.mkdir("C:\\Users\\Desktop\\Playground\\Samples\\CSV_Reports\\")
        save_path = "C:\\Users\\Desktop\\Playground\\Samples\\CSV_Reports\\10-Response.csv"
        filename = "C:\\Users\\Desktop\\Playground\\Samples\\10-Response.txt"
        return save_path

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 12 · 2015-07-01T14:03:42+00:00

You need to compute the save_path from the filename argument (which is the path to the source file). The idea is that you dont dictate the path individually, but you dictate a rule to create the save_path.

Saran_1 0 Junior Poster in Training · Answer 13 · 2015-07-01T15:27:40+00:00

I am stuck on this. Do I not need to iterate through the directory, such that:

        save_path = glob.glob("C:\\Users\\Desktop\\Playground\\Samples\\*.csv")
        dir_name = "C:\\Users\\Desktop\\Playground\\Samples\\CSV_Reports\\"
        filename = "C:\\Users\\Desktop\\Playground\\Samples\\*.txt"

And then return save_path with a for loop to go through the directory?

Saran_1 0 Junior Poster in Training · Answer 14 · 2015-07-02T18:27:09+00:00

Hi Gribouillis:

I slept on what my main goal is and I think I now understand what you are trying to explain to me. So I returned back to thedestination_csv to iterate through the cwd. I am wondering if you would be so kind as to suggest the method to dictate the rule for saving the destination path for each file that is processed? Thanks again for your help.

def destination_csv(self, path, filter):
    for root, dirs, files in os.walk(path):
        for file in fnmatch.filter(files, filter):
            yield os.path.join(root, file)


for textFile in findFiles(r'C:\\Users\\Desktop\\Playground\\Samples_Copy', '*.txt'):

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 15 · 2015-07-02T19:45:26+00:00

Sorry I don't understand this code. I think I explained what I meant as clearly as I could, but if you really don't understand it, you can use your own different method. Perhaps someone else can help you with this.

Saran_1 0 Junior Poster in Training · Answer 16 · 2015-07-02T20:29:05+00:00

Solved! I finally understand what you were saying. thanks again for the feedback!

Troubleshooting utility to parse XML objects

Recommended Answers Collapse Answers

All 21 Replies

Recommended Answers