I'm manipulating large quantities of HTML files (about 2000). I'm new with Python, and would appreciate any help.

For example, I want to delete the lines that contain "embed_music" in all the files, or change all instances of the word "Paragraph" to "Absatz".

This is my pseudo-code:

open target folder of html files (/project/html/)
read in all html files
*do the stuff here:
check for lines containing "embed_music", if yes delete
string replace for words with "Paragraph" to "Absatz"
close folder

Is my logic correct? Would appreciate any help or feedback!

to get you started...


import os

for name in os.listdir(aDirectory):
   if not name.lower().endswith('.html'):
   with open(os.path.join(aDirectory, name), 'rb') as fileobj:
      output = [line.replace("Paragraph", "Absatz") for line in fileobj.readlines() if not "embed_music" in line]
   with open(os.path.join(aDirectory, name), 'wb') as fileobj:

You may want to create a backup of the folder before you run the code....

Nice ihatehippies, I only unified the with to produce lines to output immediately to avoid temporary list (if you do lot of processing it might still pay off to use one and output from that), changed the name aDirectory to follow PEP8 convetion of words joined by underscore and removed unnecessary readlines as we can use fileobj directly as generator of lines.

import os
html_directory = r'I:\python27\Lib\site-packages\pygame\docs\ref'
pre = 'absatz'
for name in os.listdir(html_directory):
    if name.lower().endswith('.html') and not name.startswith(pre):
        with open(os.path.join(html_directory, name), 'rb') as fileobj, open(os.path.join(html_directory, '%s_%s' % (pre, name)), 'wb') as output:
          output.write(''.join(line.replace("Paragraph", "Absatz")
                               for line in fileobj if not "embed_music" in line))

Of course you never should use this kind of code for html file but use Beautifulsoup3/Beautifulsoup4 or similar to preserve the structure of files.

yeah I wasn't sure if you could open a file for writing if it was currently being read