| | |
having trouble parsing more than one html file into a csv file
![]() |
•
•
Join Date: Jan 2009
Posts: 5
Reputation:
Solved Threads: 0
Hi Everyone,
I have got a program which takes a html file as an argument, parses it, and outputs the data to a CSV file. It does this no problem. BUT, i need it to take more than one html file, parse them and put all the data collected into one CSV file.
I have tried just reproducing the code that i have for creating the csv file, but replacing the .write with .append, but this throws up an error.
The following is the code for reading the html file and writing the CSV file:
Anyone have any advice in how to get the program to take more arguements and process them in the same ay as above and then append the data onto the end of the CSV file?
Thanks in advance for any help. it is really appreciated!!
Shaun
I have got a program which takes a html file as an argument, parses it, and outputs the data to a CSV file. It does this no problem. BUT, i need it to take more than one html file, parse them and put all the data collected into one CSV file.
I have tried just reproducing the code that i have for creating the csv file, but replacing the .write with .append, but this throws up an error.
The following is the code for reading the html file and writing the CSV file:
Python Syntax (Toggle Plain Text)
if __name__ == "__main__": try: # Put getopt in place for future usage. opts, args = getopt.getopt(sys.argv[1:],None) except getopt.GetoptError: print usage(sys.argv[0]) # print help information and exit: sys.exit(2) if len(args) == 0: print usage(sys.argv[0]) # print help information and exit: sys.exit(2) html_files = glob.glob(args[0]) for htmlfilename in html_files: outputfilename = os.path.splitext(htmlfilename)[0]+'.csv' parser = html2csv() print 'Reading %s, writing %s...' % (htmlfilename, outputfilename) try: htmlfile = open(htmlfilename, 'rb') csvfile = open( outputfilename, 'w+b') data = htmlfile.read(8192) while data: parser.feed( data ) csvfile.write( parser.getCSV() ) sys.stdout.write('%d CSV rows written.\r' % parser.rowCount) data = htmlfile.read(8192) csvfile.write( parser.getCSV(True) ) csvfile.close() htmlfile.close() except: print 'Error converting %s ' % htmlfilename try: htmlfile.close() except: pass try: csvfile.close() except: pass print 'All done. '
Anyone have any advice in how to get the program to take more arguements and process them in the same ay as above and then append the data onto the end of the CSV file?
Thanks in advance for any help. it is really appreciated!!
Shaun
•
•
Join Date: Jan 2009
Posts: 5
Reputation:
Solved Threads: 0
Hi, thank you for the reply.
When the code is running, there is only one output file, the problem is that it get over written by the next .html file that is processed.
When i run python filename.py *.html
it processed all .html files in the folder, but the one csv file that is written only contains the data read from the last .html file, when i try to put .append instead of .write, the program doesn't run and throws up an attribute error.
Does anyone have any ideas how i could do this?
Thanks
Shaun
When the code is running, there is only one output file, the problem is that it get over written by the next .html file that is processed.
When i run python filename.py *.html
it processed all .html files in the folder, but the one csv file that is written only contains the data read from the last .html file, when i try to put .append instead of .write, the program doesn't run and throws up an attribute error.
Does anyone have any ideas how i could do this?
Thanks
Shaun
•
•
Join Date: Jul 2007
Posts: 66
Reputation:
Solved Threads: 14
When you open your csv-file, you use the mode "w+b". Any reason you're opening the file as a binary instead of a regular text-file? Anyway, if you want to append text to a file you have to use "a+" (or "a+b").
If you take a look at http://docs.python.org/library/functions.html#open it says:
I.e. when you use "w+b" you truncate (deleting the file content) the file when you open it.
Hope this can help
If you take a look at http://docs.python.org/library/functions.html#open it says:
•
•
•
•
The most commonly-used values of mode are 'r' for reading, 'w' for writing (truncating the file if it already exists), and 'a' for appending (which on some Unix systems means that all writes append to the end of the file regardless of the current seek position)
Hope this can help
•
•
Join Date: Mar 2009
Posts: 181
Reputation:
Solved Threads: 28
•
•
•
•
Hi, thank you for the reply.
When the code is running, there is only one output file, the problem is that it get over written by the next .html file that is processed.
When i run python filename.py *.html
it processed all .html files in the folder, but the one csv file that is written only contains the data read from the last .html file, when i try to put .append instead of .write, the program doesn't run and throws up an attribute error.
Does anyone have any ideas how i could do this?
Thanks
Shaun
You can still take vidaj's advice if you use the same output file and want to maintain the data over multiple runs of your program. But you should also take the advice above. There's no need to open the file more than once.
![]() |
Other Threads in the Python Forum
- Previous Thread: background looping
- Next Thread: Python Gui refresh
| Thread Tools | Search this Thread |
abrupt accessdenied anti apache application approximation argv array beginner book builtin calculator change converter countpasswordentry curved dan08 dictionaries dictionary dynamic edit enter examples file float format function gui heads homework import inches input java keyboard lapse launcher library line lines linux list lists loop microphone mouse movingimageswithpygame mysqlquery newb number numbers numeric output parameters parsing path phonebook plugin port prime programming projects py2exe pygame pyopengl python random recursion redirect remote reverse scrolledtext session simple software sprite statictext string strings syntax table terminal text textarea thread threading time tlapse trick tuple tutorial twoup ubuntu unicode unit urllib urllib2 variable wordgame wxpython





