Text Manipulation Help

Question

Garee 0 Newbie Poster

15 Years Ago

So for one of my university projects we have been assigned a problem to complete. I have the code working fine for the example output provided however I just need some help regarding a few errors that need fixing with different inputs.

I am not asking for you to do this for me as I have done most if not all of the program but I just need some help with errors and general ways to make the code more presentable.

Here is the problem and my code below:

You are to write an indexing program that will record and print out on which lines particular
words appear in a piece of text supplied as input by the user. Hence, the index you generate
will look like a book index, but each index entry will have a word followed by the line
numbers on which the word appears, rather than the page numbers.
Specifically, your program should:
a) read in lines of text one at a time, keeping track of the line numbers, stopping when a
line is read that contains only a single full-stop;
b) remove punctuation (as specified below) and change all text to lowercase;
c) remove stop words (the stop word list is specified below);
d) stem the words (the common endings to look out for are specified below);
e) add the remaining words to the index – a word should appear only once in the index
even though it may appear many times in the text, and the line numbers on which it
appears (removing duplicates) should be recorded with the word;
f) print the index, using exactly the format below, once all lines have been entered.

import string

pMarks = ".,:;!?&'"

sWords = ['a','i','it','am','on','in','of','to','is','so', \
          'too','my','the','and','but','are','very','here','even','from' \
          'them','then','than','this','that','though']

endings = ['s','es','ed','er','ly','ing']

def removePunc(text):
    nopunc = ""
    for char in text:
        if char not in pMarks:
            nopunc = nopunc + char
    return nopunc.lower().split()

def removeStop(text):
    nostop = []
    for word in text:
        if word not in sWords:
            nostop.append(word)
    return nostop

def stemWords(words):
    for wrd in words:
        for n in range(1,4):
            if wrd[-n:] in endings:
                index = words.index(wrd)
                words.remove(wrd)
                words.insert(index,wrd[:-n])
    return words

def removeDuplicates(words):
    nodupe = []
    for wrd in words:
        if wrd not in nodupe:
            nodupe.append(wrd)
    return nodupe

def main():
    lines = []
    textTwo = ""

    text = raw_input("Indexer: type in lines, finish with a . at start of line only \n")
    if text == ".":
        exit()
    lines.append(text)

    while textTwo != ".":
        textTwo = raw_input()
        lines.append(textTwo)
        text = text + " " + textTwo
        if textTwo == ".":
            lines = lines[:len(lines)-1]

    text = removePunc(text)
    text = removeStop(text)
    text = stemWords(text)
    text = removeDuplicates(text)

    print "The Index is:"
    for word in text:
        lineNumbers = []
        for l in lines:
            if word in l:
                lineNumbers.append(lines.index(l)+1)
        print word, lineNumbers

main()

What could be done to ensure that the words are stemmed fully and correctly? For example if i had "annoyingly" or "sings" they contain more than one ending.

Also for the output, my code prints out "wind [1,3,4]" instead of "wind 1, 3, 4".

Also we are not allowed to use any code that we havent covered in the course so far, so just the basic operands can be used.

Any help would be great thanks.

python

3 Contributors
2 Replies
142 Views
1 Day Discussion Span
Latest Post 15 Years Ago Latest Post by masterofpuppets

All 2 Replies

slate 241 Posting Whiz in Training

15 Years Ago

What is the expected output for sings?
The empty string?
If you remove s,ing and s from the word in that order, the empty string remains...

If that is so....
I would make a stem_word function, that would look like:

def stem_word(word):
    word_new=None
    for e in endings:
        if word.endswith(e):
            word_new=word[:len(word)-len(e)] 
            break
    else:
        return word #no stemming was made
    return  stem_word(word_new)

If sing is the expected output, then what is the expected output for commings?

I would say, this is not a trivial algo. And that does not have anything to do with python per se.

The desired wind 1.3,4 output can be achieved, if the line:
print word, lineNumbers
is changed to:
print word, ",".join(lineNumbers)

BTW for the line:

if wrd not in nodupe:

I would use set (or dict) for this. In operator is fare more efficient on a hash table. The same for sWords and endings.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

masterofpuppets · Answer 1 · 2009-12-05T06:53:32+00:00

hi,
well, here's my suggestion for the stem function. I hope it can solve your problem

endings = ['s','es','ed','er','ly','ing']

def stemWords( words ):
    for wrd in words:
        index = words.index( wrd )
        for end in endings:
            wrd = wrd.rstrip( end )
        del words[ index ]
        words.insert( index, wrd )

    return words

print stemWords( [ "annoyingly", "sings" ] )

>>> 
['annoy', 's']
>>>

I think I have not tested this enough so there may be problems with it :)

as for the printing. The reason it prints it as a list is because it is a list. So you need a small loop to print the elements in the list, like this maybe:

print "The Index is:"
for word in text:
    lineNumbers = []
    for l in lines:
        if word in l:
            lineNumbers.append(lines.index(l)+1)
    print word,    # comma leaves the marker at the same line
    for l in lineNumbers:
        print l
    print   # moves the marker a line down

hope this is helpful :)

Text Manipulation Help

Recommended Answers Collapse Answers

All 2 Replies

Recommended Answers