filter out specific words in the text file

Question

mysticstylez 0 Newbie Poster

14 Years Ago

Hi,

So, i have a list that contains some words. I need to read a text file and cross reference it with each word in the list. if that word occurs in the text file, i need to filter it out with "*". How would i go about doing that. And would it be easier to read the text file into a list and then cross reference is with the filter list?

python

5 Contributors
6 Replies
3K Views
2 Days Discussion Span
Latest Post 14 Years Ago Latest Post by griswolf

griswolf 304 Veteran Poster

14 Years Ago

If you need to handle arbitrarily large input files, then you need to read one line, write one line in a loop (you could, of course, read some lines, write some lines, but there is no easy api for that)

If you know that the input is short enough to hold in memory, then readlines() works fine.

The heart of the code would be something like:

for line in fileList:
  saver = []
  words = line.split()
  for word in words:
    if word in proscribedList:
      saver.append("")
    else
      saver.append(word)
  outputLine = ' '.join(saver)
  print >> output, outputLine+'\n'

TrustyTony 888 pyMod

14 Years Ago

Consider which of next words must be replaced if 'of' is in word list:

'''off shore
switch off
often
"of course"
of.'''

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

kur3k -3 Light Poster · Answer 1 · 2010-05-18T05:12:38+00:00

Hi

My english is not good, i dont understend all question but i wrote some code

# -*- coding: utf-8 -*-

try:
    _file = open("file.txt", "r")
except IOError:
       print "> I dont see files.txt!"
       
replaces = { "lol":len("lol")*"*",
             "emo":len("emo")*"*" }
             
read = _file.read()

for i in read.split():
    if i in replaces.keys():
       print replaces[i],
    else:
         print i,
         
plik.close()

woooee 814 Nearly a Posting Maven · Answer 2 · 2010-05-18T21:42:09+00:00

Consider which of next words must be replaced if 'of' is in word list:
'''off shore
switch off
often
"of course"
of.'''

The simplest way to handle that is to use a dictionary of lists with the key being the length of the word and the list containing all words of that length, so the modified code would be

if word in words_dict[len(word)]:

Good luck on the rest of the code.

mysticstylez 0 Newbie Poster · Answer 3 · 2010-05-20T01:54:07+00:00

Thanks guys for your help. I like your approach griswolf. I was able to filter out the words in the list, but now when i try to write it to a text file, it only writes the last line. Heres the code:

import os

os.system('CLS')


filterFile = open("c:\python26\project3\Filter.txt", 'r')
filter = []
for currentline in filterFile:
    line = currentline.strip()
    filter.append(line)

print filter

scriptFile = open("c:\python26\project3\script.txt", 'r')
script = []
for currentline in scriptFile:
    line = currentline.strip()
    script.append(line)
    

for lines in script:
    results = []
    words = lines.split()
    for word in words:
        if word in filter:
            results.append("*******")
        else:
            results.append(word)
            
    print results
    
resultFile = open("c:\python26\project3\result.txt", 'w')
for items in results:
    resultFile.write(items)

script file has about 44 lines of text

griswolf 304 Veteran Poster · Answer 4 · 2010-05-20T06:08:39+00:00

Well, some comments first

* You should use a set to hold the filter words, not a list.

* You are writing all the output on a single line in the result file (because you have strip()ed off the newlines at line 17.

* Your original statement of the problem wanted to preserve input lines with some words filtered out. You aren't doing that here. (My original code only approximated that by replacing any white space in the original with a single space in the output)

* You are badly using singular and plural nouns to name your variables: Array/list/set should have a plural name, but each item from that aggregate is singular. Naming things as you do is not syntactically wrong, but it is confusing to those of us who are native English speakers.

* I don't see what you are doing wrong to not get all the output. In production, I would have used something more like this:

resultFile = None
try:
  resultFile = open(theOutputFilePath,'w')
  for item in results:
    print >> resultFile, item # this appends a newline
except Exception, x:
  # be sure to import sys somewhere above
  print >> sys.stderr, "Drat. Error '%s' of type %s"%(x,type(x))
finally:
  if resultFile:
     resultFile.close()

but even without the close(), when python exits, it should close (and therefore flush) the output file.