0

Hello,

I'm manipulating large quantities of HTML files (about 2000). I'm new with Python, and would appreciate any help.

For example, I want to delete the lines that contain "embed_music" in all the files, or change all instances of the word "Paragraph" to "Absatz".

This is my pseudo-code:

open target folder of html files (/project/html/)
read in all html files
*do the stuff here:
check for lines containing "embed_music", if yes delete
string replace for words with "Paragraph" to "Absatz"
*
close folder

Is my logic correct? Would appreciate any help or feedback!

3
Contributors
3
Replies
4
Views
5 Years
Discussion Span
Last Post by ihatehippies
0

to get you started...

*untested

import os

for name in os.listdir(aDirectory):
   if not name.lower().endswith('.html'):
      continue
   with open(os.path.join(aDirectory, name), 'rb') as fileobj:
      output = [line.replace("Paragraph", "Absatz") for line in fileobj.readlines() if not "embed_music" in line]
   with open(os.path.join(aDirectory, name), 'wb') as fileobj:
      fileobj.write(''.join(output))

You may want to create a backup of the folder before you run the code....

Edited by ihatehippies: n/a

0

Nice ihatehippies, I only unified the with to produce lines to output immediately to avoid temporary list (if you do lot of processing it might still pay off to use one and output from that), changed the name aDirectory to follow PEP8 convetion of words joined by underscore and removed unnecessary readlines as we can use fileobj directly as generator of lines.

import os
html_directory = r'I:\python27\Lib\site-packages\pygame\docs\ref'
pre = 'absatz'
for name in os.listdir(html_directory):
    if name.lower().endswith('.html') and not name.startswith(pre):
        with open(os.path.join(html_directory, name), 'rb') as fileobj, open(os.path.join(html_directory, '%s_%s' % (pre, name)), 'wb') as output:
          output.write(''.join(line.replace("Paragraph", "Absatz")
                               for line in fileobj if not "embed_music" in line))

Of course you never should use this kind of code for html file but use Beautifulsoup3/Beautifulsoup4 or similar to preserve the structure of files.

Edited by pyTony: n/a

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.