Stemming words in python.

Question

andrewktmeikle 0 Newbie Poster

15 Years Ago

Hello, I'm having a slight problem with my code. The task is to create an indexing program, similar to the ones the google uses.
The problem i'm having is that we have to remove the common ending from the words left after the removal of stop_words(which is a list variable not a string variable). I proceed to convert every item in the list, one at a time, to a string with code as follows

leaf_words = "s","es","ed","er","ly","ing"
    
for words in line_stop_words:
  #line_stop_words is the list of words without any "stop words" present, eg only the essential info  
    stemming_word = ""
    
    for chars in words:
    
        print chars
        stemming_word = chars
        
        if stemming_word[-1] == leaf_words:
            
            stemming_word[-1] = ""
            #to remove that letter from the string
        print stemming_word

Two issues i have are that, each time its finished with the first line of text, it throws an error, saying the index is out of bounds. The problem i believe lies in the if statement because i dont think the for loop is moving to the next item in the line_stop_words list.

Second of all it doesnt actually remove the leaf word from the main string ( Say you have blows, it doesnt remove the s)

Any help or advice you can would be very helpful.

The rest of the code, so you know what im talking about is:

import string

i = 0
text_input = ""
total_text_input = ""
line = []
n = 0
char = ""

while i != 1:
    text_input = raw_input ("")
    if text_input == ".":
        i = 1
    else:
        new_char_string = "" 
        for char in text_input:
            if char in string.punctuation:
                char = " "
                
            new_char_string = new_char_string + char
            
        line = line + [new_char_string.lower()]
        total_text_input = (total_text_input + new_char_string).lower()
stop_words = "a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though"

line_stop_words = []
word_list = ""
sent = ""
word = ""

for sent in line:
    word_list = string.split(sent)
    new_string = ""
    for word in word_list:
        if word  not in stop_words:
            new_string = new_string + word + ";"
    new_string = string.split(new_string,";")
    line_stop_words = line_stop_words +[new_string]

python

4 Contributors
7 Replies
2K Views
1 Day Discussion Span
Latest Post 15 Years Ago Latest Post by andrewktmeikle

All 7 Replies

Namibnat 10 Junior Poster in Training

15 Years Ago

Please could you clearly say what you are trying to do, I don't really understand.

Your code looks really messy. I am going to try to clearly ask you a few questions and then you could perhaps rephrase your question.

I proceed to convert every item in the list, one at a time, to a string

You could just create a list of strings right away?

leaf_words = "s","es","ed","er","ly","ing"

Would you like to remove those from the endings of all words? That would cut words like 'as' down to 'a' and 'cover' to 'cov'. Is that what you intend. Otherwise you need to work with a very large database of English and some horrid regular expression.

Two issues i have are that, each time its finished with the first line of text, it throws an error, saying the index is out of bounds. The problem i believe lies in the if statement because i dont think the for loop is moving to the next item in the line_stop_words list.

Python is so fantastic because you can avoid this mess of if and for loops. The more loops, the more messy a program gets.

while i != 1:

You have made the code much more complex here than needs be. It also seems like something that the user would have to know before hand. Why not just let them enter a string, split it down and work from that list of strings?

Something like this:
words = raw_input('Enter your string\n: ')
words_list = words.split()

If you want to remove all punctuation from the list and any 'leaf_words' or whatever, just make a list of all of those, iterate through the list and remove comparisons from the 'word_list'

check = ['!', '@', '#', '$', 'as', 'is', 'was', 'or whatever']
input = ['one', 'two', 'as', 'zed']
for z in check:
    if z in input:
        input.remove(z)

sent = ""

You don't need to initialize variables in Python. There are times when it can be useful, especially when you want to append a list or something like that. But for a for loop there is no reason.

And so on. Perhaps make it more clear what you want the program to do. Something like:

1. Get input from a user, split it out as a list of strings.
2. Search for and remove punctuation
3. Search for and remove certain words
4. Search each string for certain endings and remove them
5. Create a new list with the remaining values

vegaseat 1,735 DaniWeb's Hypocrite

15 Years Ago

In code line
if stemming_word[-1] == leaf_words:
you are trying to compare a character with a tuple.

# this will form a tuple
leaf_words = "s","es","ed","er","ly","ing"

print( leaf_words )  # ('s', 'es', 'ed', 'er', 'ly', 'ing')

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Mathhax0r 2 Junior Poster in Training · Answer 1 · 2009-12-08T00:29:22+00:00

def removePostfix(argWord):
	leaves = "s", "es", "ed", "er", "ly", "ing"
	for leaf in leaves:
		if argWord[-len(leaf):] == leaf:
			return argWord[:-len(leaf)]

Here's for you're first problem.

Mathhax0r 2 Junior Poster in Training · Answer 2 · 2009-12-08T00:41:59+00:00

Ah, I didn't even take into consideration about what namibnat mentioned. Though it's true you'll have to use some sort of dictionary for what you intend, I don't think it'll be that hard. You can go to nltk.org for a natural language processing module and go from there.

andrewktmeikle 0 Newbie Poster · Answer 3 · 2009-12-08T04:10:43+00:00

Hey, wow thanks for such a rapid response. Okay, for clarity purposes, task declaration is as follows:
i) Read in lines of text, one at a time, keeping track of the line numbers, stopping when a line is read that contains only a "."
ii)Remove punctuation, and change all characters to lower case
iii)Remove stop words(also included the the text i gave)
iv)Stem the words( the common wordings were leaf words)
v)Add the remaining words to and index( stating how many times the word appears and in what line)

The test case ive been using is:

It is a briskly blowing wind that blows
from the north, the North of my youth.
The wind is cold too, colder than the
winds of yesteryear.

The way the problem has been written, has been the way i have written the program, hence the rather odd structure to my code which i apologise for i'm still in the learning phase of python.

I am limiting myself to only use the, lower, split function of the string module and only if, for and while conditions

In response to namibat, i was referring to the process of taking every line of text that is input,storing it in a list, in order to preserve the line structure, eg ([It is a briskley blowing wind that blows],[ from the north, The north of my youth.] , etc, etc)

No im not trying to write a program as complex that it included the whole english language, it is acceptable to only stem the word once, because it done more than once eg "dresses" would become dr, if stemmed mutplie times

The bank of word endings i want to remove are "leaf words", if i was to make the program completly functional, eg have some kind of algorithm that compared the current word once stemmed to a dictionary library, but at the moment i'm just to prove that the stemming process is functional

Mathhax0r you make a good point with the code you posted for me, i hadnt thought of using the for loop like that.

Vegaseat should the process then be :

while i < len(leaf_words):
      if stemming_word[-1] == leaf_words[i]:

then continue from there?

Thanks again, very insightful comment much appreciated.

andrewktmeikle 0 Newbie Poster · Answer 4 · 2009-12-09T01:36:22+00:00

Thanks guys for you help, ive spent most of today working on it and all functionality is working correctly.

Thanks again for all your comments

andrewktmeikle 0 Newbie Poster · Answer 5 · 2009-12-09T01:39:24+00:00

I'll post my working code, fully cleaned up and functional

import string

def RemovePunc():
    line = []
    i = 0
    text_input = ""
    total_text_input = ""
    #This part removes the punctuation and converts input text to lowercase
    while i != 1:
        text_input = raw_input ("")
        if text_input == ".":
            i = 1
        else:
            new_char_string = "" 
            for char in text_input:
                if char in string.punctuation:
                    char = " "
                    
                new_char_string = new_char_string + char
                
            line = line + [new_char_string.lower()]
            #This is a list with all of the text that was entered in
            total_text_input = (total_text_input + new_char_string).lower()
    return line

def RemoveStopWords(line):
    line_stop_words = []
    stop_words = "a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though"
    #this part removes the stop words for the list of inputs
    line_stop_words = []
    sent = ""
    word = ""
    test = []
    for sent in line:
        word_list = string.split(sent)
        new_string = ""
        for word in word_list:
            if word  not in stop_words:
                new_string = new_string + word + " "
        new_string = string.split(new_string)
        line_stop_words = line_stop_words + [new_string]
    return(line_stop_words)


def StemWords(line_stop_words):
    leaf_words = "s","es","ed","er","ly","ing"
    i=0
    while i < 6:    
        count = 0
        length = len(leaf_words[i])
        while count < len(line_stop_words):
            line = line_stop_words[count]
            count2 = 0
            while count2 < len(line):
                #line is the particular list(or line) that we are dealing with, count if the specific word
                if leaf_words[i] == line[count2][-length:]:
                    line[count2] = line[count2][:-length]
                count2 = count2 + 1
            line_stop_words[count] = line
            count2 = 0
            count = count + 1
        count = 0
        i = i + 1
    return(line_stop_words)

def indexDupe(lineCount,occur):
    if str(lineCount) in occur:
        return True
    else:
        return False

def Indexing(line_stop_words):
    line_limit = len(line_stop_words)
    index = []
    line_count = 0

    while line_count < line_limit:
        for x in line_stop_words[line_count]:
            count = 0
            while count <= len(index):
                if count == len(index):
                    index = index + [[x,[str(line_count+1)]]]
                    break
                else:
                    if x == index[count][0]:
                        if indexDupe(line_count+1,index[count][1]) == False:
                            index[count][1] += str(line_count+1)
                        break
                    
                        
                count = count + 1

        line_count = line_count + 1
    return(index)


def OutputIndex(index):
    
    print "Index:"
    count = 0
    indexLength = len(index)
    while count < indexLength:
        print index[count][0],
        count2 = 0
        lineOccur = len(index[count][1])
        while count2 < lineOccur:
            print index[count][1][count2],
            if count2 == lineOccur -1:
                print ""
                break
            else:
                print ",",
            count2 += 1
            
        count += 1

line = RemovePunc()   
line_stop_words = RemoveStopWords(line)
line_stop_words = StemWords(line_stop_words)    
index = Indexing(line_stop_words)
OutputIndex(index)

Just for anyone reading that is interested.

Stemming words in python.

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers