Remove non-stop words from a file

Question

kshw 3 Newbie Poster

13 Years Ago

Hi,

I'm trying to remove non-stop words from a text file using regular expresions but it is not working. I used something like ('^[a-z]?or') in order to avoid removing (or) from the mibble of words e.g. morning.

Temp = []
Original_File = open('out.txt', 'r')
Original_File_Content = Original_File.read()
Original_File.close()

Temp.append("".join(Original_File_Content))

FileString = "".join(Temp)

p = re.compile( "^[a-z]?is|^[a-z]?or|^[a-z]?in")
RemoveWords = p.sub( '', FileString)

Thanks

python

2 Contributors
3 Replies
207 Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by woooee

Agni commented: excellent first post !! welcome to daniweb +3

All 3 Replies

woooee 814 Nearly a Posting Maven

13 Years Ago

This is trivial without regular expressions. Read one record, split into words, and test each word; read the next record, etc. Note that the following two lines probably don't do anything as Original_File_Content is one, long string. See Section 7.2.1 for clarification http://docs.python.org/release/2.5.2/tut/node9.html#SECTION009200000000000000000. If it is a very large file, then converting to a set and comparing to a set of stop words using set.difference() would be faster.

Temp.append("".join(Original_File_Content))
 
FileString = "".join(Temp)

Edited 13 Years Ago by woooee because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

kshw 3 Newbie Poster · Answer 1 · 2010-06-29T22:39:28+00:00

Thank you..

Well, I was testing my work on text files but the bigger picture is that I'm working with BeautifulSoup objects. So, they have to be converted to string so I can manipulate the text. It has been successful so far except for the non-stop words.

html = urllib2.urlopen(someurl.html).read()
    soup = BeautifulSoup(html) 
   
    # Remove tags and non-stop words
    p = re.compile( "<.*?>|^[a-z]?is|^[a-z]?or|^[a-z]?in")
    RemoveWords = p.sub( '' , str(soup))

    p = re.compile( r'\W+' )
    WordList = p.split(RemoveWords)

I provided a minimum number for non-stop words for simplicity. There are more than 200 non-stop words and It will be difficult (and I assume inefficient) to test everyword in my list to non-stop words. That's why I considered re.

woooee 814 Nearly a Posting Maven · Answer 2 · 2010-06-30T21:45:32+00:00

It will be difficult (and I assume inefficient) to test every word in my list to non-stop words

There is no other way to do it no matter what method is used. Every word has to be checked somehow. 200 words is not worth worrying over. 200,000 words would require some tweaking. If you are concerned about the amount of time it will take, then consider using a set or dictionary as they are indexed via a hash.

Remove non-stop words from a file

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers