having trouble parsing more than one html file into a csv file

Reply

Join Date: Jan 2009
Posts: 5
Reputation: shaun.b is an unknown quantity at this point 
Solved Threads: 0
shaun.b shaun.b is offline Offline
Newbie Poster

having trouble parsing more than one html file into a csv file

 
0
  #1
Apr 10th, 2009
Hi Everyone,

I have got a program which takes a html file as an argument, parses it, and outputs the data to a CSV file. It does this no problem. BUT, i need it to take more than one html file, parse them and put all the data collected into one CSV file.

I have tried just reproducing the code that i have for creating the csv file, but replacing the .write with .append, but this throws up an error.

The following is the code for reading the html file and writing the CSV file:

  1. if __name__ == "__main__":
  2. try: # Put getopt in place for future usage.
  3. opts, args = getopt.getopt(sys.argv[1:],None)
  4. except getopt.GetoptError:
  5. print usage(sys.argv[0]) # print help information and exit:
  6. sys.exit(2)
  7. if len(args) == 0:
  8. print usage(sys.argv[0]) # print help information and exit:
  9. sys.exit(2)
  10. html_files = glob.glob(args[0])
  11. for htmlfilename in html_files:
  12. outputfilename = os.path.splitext(htmlfilename)[0]+'.csv'
  13. parser = html2csv()
  14. print 'Reading %s, writing %s...' % (htmlfilename, outputfilename)
  15. try:
  16. htmlfile = open(htmlfilename, 'rb')
  17. csvfile = open( outputfilename, 'w+b')
  18. data = htmlfile.read(8192)
  19. while data:
  20. parser.feed( data )
  21. csvfile.write( parser.getCSV() )
  22. sys.stdout.write('%d CSV rows written.\r' % parser.rowCount)
  23. data = htmlfile.read(8192)
  24. csvfile.write( parser.getCSV(True) )
  25. csvfile.close()
  26. htmlfile.close()
  27. except:
  28. print 'Error converting %s ' % htmlfilename
  29. try: htmlfile.close()
  30. except: pass
  31. try: csvfile.close()
  32. except: pass
  33. print 'All done. '

Anyone have any advice in how to get the program to take more arguements and process them in the same ay as above and then append the data onto the end of the CSV file?

Thanks in advance for any help. it is really appreciated!!

Shaun
Reply With Quote Quick reply to this message  
Join Date: Mar 2009
Posts: 181
Reputation: adam1122 is an unknown quantity at this point 
Solved Threads: 28
adam1122 adam1122 is offline Offline
Junior Poster

Re: having trouble parsing more than one html file into a csv file

 
0
  #2
Apr 10th, 2009
Why not open the csv file outside of the loop? That would result in having only one csv output.
Reply With Quote Quick reply to this message  
Join Date: Jan 2009
Posts: 5
Reputation: shaun.b is an unknown quantity at this point 
Solved Threads: 0
shaun.b shaun.b is offline Offline
Newbie Poster

Re: having trouble parsing more than one html file into a csv file

 
0
  #3
Apr 15th, 2009
Hi, thank you for the reply.

When the code is running, there is only one output file, the problem is that it get over written by the next .html file that is processed.

When i run python filename.py *.html

it processed all .html files in the folder, but the one csv file that is written only contains the data read from the last .html file, when i try to put .append instead of .write, the program doesn't run and throws up an attribute error.

Does anyone have any ideas how i could do this?

Thanks

Shaun
Reply With Quote Quick reply to this message  
Join Date: Jul 2007
Posts: 66
Reputation: vidaj is an unknown quantity at this point 
Solved Threads: 14
vidaj vidaj is offline Offline
Junior Poster in Training

Re: having trouble parsing more than one html file into a csv file

 
0
  #4
Apr 15th, 2009
When you open your csv-file, you use the mode "w+b". Any reason you're opening the file as a binary instead of a regular text-file? Anyway, if you want to append text to a file you have to use "a+" (or "a+b").

If you take a look at http://docs.python.org/library/functions.html#open it says:
The most commonly-used values of mode are 'r' for reading, 'w' for writing (truncating the file if it already exists), and 'a' for appending (which on some Unix systems means that all writes append to the end of the file regardless of the current seek position)
I.e. when you use "w+b" you truncate (deleting the file content) the file when you open it.

Hope this can help
Reply With Quote Quick reply to this message  
Join Date: Mar 2009
Posts: 181
Reputation: adam1122 is an unknown quantity at this point 
Solved Threads: 28
adam1122 adam1122 is offline Offline
Junior Poster

Re: having trouble parsing more than one html file into a csv file

 
0
  #5
Apr 15th, 2009
Originally Posted by shaun.b View Post
Hi, thank you for the reply.

When the code is running, there is only one output file, the problem is that it get over written by the next .html file that is processed.

When i run python filename.py *.html

it processed all .html files in the folder, but the one csv file that is written only contains the data read from the last .html file, when i try to put .append instead of .write, the program doesn't run and throws up an attribute error.

Does anyone have any ideas how i could do this?

Thanks

Shaun
That's why I suggested you open the CSV file outside of the loop. Why open the CSV every time you open an HTML file? You only need to open it once (and close it once).

You can still take vidaj's advice if you use the same output file and want to maintain the data over multiple runs of your program. But you should also take the advice above. There's no need to open the file more than once.
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Other Threads in the Python Forum
Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC