I have a strange problem, I have written a for loop to check a list of words for a match and it is not checking all of the words:

I am using two files to check for matched words (Enwords.txt - a list of English words and Encontract.txt - a list of English contractions) and I am able to find a match against that list and a corresponding list that doesn't match. When I run the non-matched list against the Encontract.txt file, it is not checking all the words. Here is my code:

fullwords = open("Enwords.txt").read()
contrwords = open("Encontract.txt").read()
wordlist = []
nonwordlist = []
failedwords = []
string4 = "I have already explained this I thought. Okay here it is again. I beleive that the laws of physics apply to the universe and always have. Those laws do not allow the universe to be only 6000 years old. So the only way in which God could be responsible for the creation of the universe is if the bible is incorrect in its description of creation. That doesn't mean there is no God, but only that if he does exist, he is not the author of the bible if he meant to have Genesis taken literally. This is what I believe. There is no way around it. If what you say is actually true, then all the laws of physics are malarky. I cannot accept that. The big bang happend 13.7 billion years ago. Period, So if God caused it, Genesis is simply wrong. If he didn't cause it, then there most likely is no God. I can't think of another way to put it."
testdoc = string4

stripped_text = ""
for c in testdoc:
    if c in "!@#$%^&*().[],{}<>?":   # need to remove punctuation from list of words
        c = ""
    stripped_text += c
testdoc = stripped_text

words = testdoc.split(' ')

for word in words:
    word = word.lower()
    if word in fullwords:           # test to see if word in full word list
        if word not in wordlist:
            wordlist.append(word)
    else:
        if word not in nonwordlist:
            nonwordlist.append(word)

print nonwordlist
word = ""
print "Checking Alternate Word List . . ."  # check dictionary for alternate words and update
for word in nonwordlist:
    if word in contrwords:
        wordlist.append(word)
        nonwordlist.remove(word)
        print "Added ", word, " to the word list."
    else:
        print "Failed to find ", word
        nonwordlist.remove(word)
        failedwords.append(word)

print ""
wordlist.sort()
print "Word list: ", wordlist
print ""
nonwordlist.sort()
print "Non word list: ", nonwordlist
print ""
failedwords.sort()
print "Failed Words: ",  failedwords
print ""

And here are the results:

python indextext.py
['beleive', '6000', "doesn't", 'happend', '137', "didn't", "can't"]
Checking Alternate Word List . . .
Failed to find beleive
Added doesn't to the word list.
Failed to find 137
Added can't to the word list.

Word list: ['accept', 'actually', 'again', 'ago', 'all', 'allow', 'already', 'always', 'and', 'another', 'apply', 'are', 'around', 'author', 'bang', 'be', 'believe', 'bible', 'big', 'billion', 'but', "can't", 'cannot', 'cause', 'caused', 'could', 'creation', 'description', 'do', 'does', "doesn't", 'exist', 'explained', 'for', 'genesis', 'god', 'have', 'he', 'here', 'i', 'if', 'in', 'incorrect', 'is', 'it', 'its', 'laws', 'likely', 'literally', 'malarky', 'mean', 'meant', 'most', 'no', 'not', 'of', 'okay', 'old', 'only', 'period', 'physics', 'put', 'responsible', 'say', 'simply', 'so', 'taken', 'that', 'the', 'then', 'there', 'think', 'this', 'those', 'thought', 'to', 'true', 'universe', 'way', 'what', 'which', 'wrong', 'years', 'you']

Non word list: ['6000', "didn't", 'happend']

Failed Words: ['137', 'beleive']

The problem is that it is not finding "didn't" when it is on the list and it doesn't appear to be checking "didn't", '6000' or 'happend' when it checks the failedwords list against the Encontract.txt file.

The Enword.txt and Encontract.txt files are plain text files with one word per line used for checking for valid words. I have verified the existence of the expected contractions (can't, didn't, doesn't) but can't tell what is happening.

Any help would be appreciated...

If finds "didn't" in the following test. Also, you do not strip the newline character(s) after reading the files, and because you use "in" instead of creating a dictionary or set of individual words and comparing word with word; the word "and" will be found, i.e. is in, when compared to the word "sandy"

#fullwords = open("Enwords.txt").read()
#contrwords = open("Encontract.txt").read()
fullwords=['accept', 'actually', 'again', 'ago', 'all', 'allow', 'already', 'always', 'and', 'another', 'apply', 'are', 'around', 'author', 'bang', 'be', 'believe', 'bible', 'big', 'billion', 'but', "can't", 'cannot', 'cause', 'caused', 'could', 'creation', 'description', 'do', 'does', "doesn't", 'exist', 'explained', 'for', 'genesis', 'god', 'have', 'he', 'here', 'i', 'if', 'in', 'incorrect', 'is', 'it', 'its', 'laws', 'likely', 'literally', 'malarky', 'mean', 'meant', 'most', 'no', 'not', 'of', 'okay', 'old', 'only', 'period', 'physics', 'put', 'responsible', 'say', 'simply', 'so', 'taken', 'that', 'the', 'then', 'there', 'think', 'this', 'those', 'thought', 'to', 'true', 'universe', 'way', 'what', 'which', 'wrong', 'years', 'you']
contrwords=["can't", "didn't", "doesn't"]
wordlist = []
nonwordlist = []
failedwords = []
string4 = "I have already explained this I thought. Okay here it is again. I beleive that the laws of physics apply to the universe and always have. Those laws do not allow the universe to be only 6000 years old. So the only way in which God could be responsible for the creation of the universe is if the bible is incorrect in its description of creation. That doesn't mean there is no God, but only that if he does exist, he is not the author of the bible if he meant to have Genesis taken literally. This is what I believe. There is no way around it. If what you say is actually true, then all the laws of physics are malarky. I cannot accept that. The big bang happend 13.7 billion years ago. Period, So if God caused it, Genesis is simply wrong. If he didn't cause it, then there most likely is no God. I can't think of another way to put it."
testdoc = string4

stripped_text = ""
for c in testdoc:
    if c in "!@#$%^&*().[],{}<>?":   # need to remove punctuation from list of words
        c = ""
    stripped_text += c
testdoc = stripped_text

words = testdoc.split(' ')
words = testdoc.split(' ')

for word in words:
    word = word.lower()
    if word in fullwords:           # test to see if word in full word list
        if word not in wordlist:
            wordlist.append(word)
    else:
        if word not in nonwordlist:
            nonwordlist.append(word)

print nonwordlist
word = ""
print "Checking Alternate Word List . . ."  # check dictionary for alternate words and update
for word in nonwordlist:
    if word in contrwords:
        wordlist.append(word)
        nonwordlist.remove(word)
        print "Added ", word, " to the word list."
    else:
        print "Failed to find ", word
        nonwordlist.remove(word)
        failedwords.append(word)

print ""
wordlist.sort()
print "Word list: ", wordlist
print ""
nonwordlist.sort()
print "Non word list: ", nonwordlist
print ""
failedwords.sort()
print "Failed Words: ",  failedwords
print

""" results --------------------
['beleive', '6000', 'happend', '137', "didn't"]
Checking Alternate Word List . . .
Failed to find  beleive
Failed to find  happend
Added  didn't  to the word list.

Word list:  ['accept', 'actually', 'again', 'ago', 'all', 'allow', 'already', 'always', 'and', 'another', 'apply', 'are', 'around', 'author', 'bang', 'be', 'believe', 'bible', 'big', 'billion', 'but', "can't", 'cannot', 'cause', 'caused', 'could', 'creation', 'description', "didn't", 'do', 'does', "doesn't", 'exist', 'explained', 'for', 'genesis', 'god', 'have', 'he', 'here', 'i', 'if', 'in', 'incorrect', 'is', 'it', 'its', 'laws', 'likely', 'literally', 'malarky', 'mean', 'meant', 'most', 'no', 'not', 'of', 'okay', 'old', 'only', 'period', 'physics', 'put', 'responsible', 'say', 'simply', 'so', 'taken', 'that', 'the', 'then', 'there', 'think', 'this', 'those', 'thought', 'to', 'true', 'universe', 'way', 'what', 'which', 'wrong', 'years', 'you']

Non word list:  ['137', '6000']

Failed Words:  ['beleive', 'happend']
"""

Edited 3 Years Ago by woooee

woooee, thank you for your response. Yes, when I test it using your code it works. However, when I switch back to using the opened files (Enwords.txt and Encontract.txt) it misses "didn't" again.

I have attached a .zip file with the two files I use, both text files so that you can test the same way that I do.

Results:

python indextext_test.py
['beleive', '6000', "doesn't", 'happend', '137', "didn't", "can't"]
Checking Alternate Word List . . .
Failed to find beleive
Added doesn't to the word list.
Failed to find 137
Added can't to the word list.

Word list: ['accept', 'actually', 'again', 'ago', 'all', 'allow', 'already', 'always', 'and', 'another', 'apply', 'are', 'around', 'author', 'bang', 'be', 'believe', 'bible', 'big', 'billion', 'but', "can't", 'cannot', 'cause', 'caused', 'could', 'creation', 'description', 'do', 'does', "doesn't", 'exist', 'explained', 'for', 'genesis', 'god', 'have', 'he', 'here', 'i', 'if', 'in', 'incorrect', 'is', 'it', 'its', 'laws', 'likely', 'literally', 'malarky', 'mean', 'meant', 'most', 'no', 'not', 'of', 'okay', 'old', 'only', 'period', 'physics', 'put', 'responsible', 'say', 'simply', 'so', 'taken', 'that', 'the', 'then', 'there', 'think', 'this', 'those', 'thought', 'to', 'true', 'universe', 'way', 'what', 'which', 'wrong', 'years', 'you']

Non word list: ['6000', "didn't", 'happend']

Failed Words: ['137', 'beleive']

Also note, that in theory there should be no entries in the Non word list, they should either show up in the Word list or in the Failed Words list. That is my struggle, because depending on the string that I am searching, different 'words' fail to process entirely.

Reading file "Encontract.txt" does not give you a list.
Use something like this ...

with open("Encontract.txt") as fin:
    contrwords_raw = fin.read()
    # make a list, also removes new line char
    contrwords = [word for word in contrwords_raw.split()]

Edited 3 Years Ago by vegaseat

As stated above, I would compare word to word instead of how you are doing it. Since the problem is obviously in the file, add "didn't" to the file or to the result of the read, and if it finds it this time then you know for sure that the file is the problem.

Thank you vegaseat and woooee; I have updated the code to use vegaseat's snippet to for contrwords and when printing contrwords, it displays the list which includes "didn't" in the list of contractions. I inserted vegaseat's code before line 11 above. However, it is still not picking up all of the Non-words ['beleive', '6000', "doesn't", 'happend', '137', "didn't", "can't"] when checking them against contrwords.

Checking Alternate Word List . . .
Failed to find beleive
Added doesn't to the word list.
Failed to find 137
Added can't to the word list.

It never shows checking for '6000', 'happend', or "didn't", this is supported by the Non word list after run as ['6000', "didn't", 'happend'] and the Failed Words list after run as ['137', 'beleive'].

You are removing element for nonwordlist in a loop,that's why it's blow up.
A rule is can be,never delete something from or add something to the list you are iterating over.

You can solve this in diffrent ways,the easiest for you now can be to make a copy of list.
for word in nonwordlist[:]:

A couple more tips.
Remove punctuation:

from string import punctuation

text = 'My!.= @car%&'
print ''.join(c for c in text if c not in punctuation) #My car

Read from file to a list,a little shorter than vega code.

with open("Encontract.txt") as fin:
    contrwords = [i.strip() for i in fin]

Edited 3 Years Ago by snippsat

Thank you snippsat for your code. However, I don't use punctuation because it breaks any contractions in the text/word list. I wan't to recognize any known contractions in the original text instead of losing them entirely. Using the code I found elswhere allows me to complete this phase of the project.

However, I don't use punctuation because it breaks any contractions in the text/word list

No problem just use signs in your code,then it will keep contractions.

text = "My!. @car%& can't be good"
print ''.join(c for c in text if c not in "!@#$%^&*().[],{}<>?") #My car can't be good

Edited 3 Years Ago by snippsat

snippsat, the contractions are in text files that will be pulled in, scanned and proper words/improper words selected and output. I was able to get your first snippet to work, but I continued to get all puntuation stripped. How are you wanting me to use that snippet? My current code is written as:

stripped_text = ""              # 2. remove punctuation from list, make list of words
for c in testdoc:
    if c in "!@#$%^&*().[],{}<>?":
        c = ""
    stripped_text += c

Where testdoc is the text file I am searching. I am not importing string or punctuation. You show

from string import punctuation

Then your code:

print ''.join(c for c in text if c not in "!@#$%^&*().[],{}<>?") #My car can't be good

Where it should print the punctuation free version of text. I am at a loss for how to do that in my code :)
It appears to try to run, but then returns the text file then the following:

`
Checking Alternate Word List . . .
Failed to find

Word list: []

Non word list: ['']
`

I still believe this is solved since you helped me, but now I'm curious how your solution would work.

Edited 3 Years Ago by Greyhelm

This question has already been answered. Start a new discussion instead.