python - Match Sentences with Missing Words [SOLVED] | DaniWeb

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Hi,
This is just part of a bigger program I'm writing, but at the moment I don't know how to do what I'm about to say.
So first, the user enters a sentence. Now, I have a text file with sentences to match it to. So let's say the first line in the text file is:
i am <state> | Do you often feel <state>?

So what I need to do is check whether the input matches the part before the | and if so, print the part after the |. The <state> is represented by any word. So if the user inputs 'i am hungry', then <state> would be hungry, and the output would be 'Do you often feel hungry'. The problem is I don't know how to check the input like that when I don't know one of the words. I can't remove the <state> and use indexing to get the next word because:
Another rule may not have the missing word at the start or end eg.
can you <w1> <w2> <w3> something | Why should I <w1> <w2> <w3> something?
The input might be 'can you help me with something' and the output would then by 'Why should I help you with something?

And one requirement of the program is I need to store the word(s) in question for later use, I'm assuming as a dictionary. So taking the above into account, the dicionary would be {<state>: 'hungry', <w1>: 'help', <w2>: 'me', <w3>: 'with'} This would assist in printing the output as I could just define any word with <> and as the user said it, the definition would be the word I need.

So how could I do such a task as to match sentences when 1 or more words is not known?

4 Contributors
35 Replies
2K Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by DevourOfDarknes

Recommended Answers

Answered by Gribouillis 1,391 in a post from 13 Years Ago

The simplest tool is regular expressions (the re module). For example
>>> import re
>>> state_re = re.compile(r"i am (?P<state>\w+)")
>>> match = state_re.match("i am hungry")
>>> print match.group("state")
hungry
This has serious limitations and you may need a real parser for your grammar rules. With a …

Answered by Gribouillis 1,391 in a post from 13 Years Ago

Thanks that works. So if the first line was: i am <something> How can I change the <something> to (?P<something>\w+)? In other words, changing it so I can use the re.compile function?
And, how can I only make it so it performs the print when the match is true? I …

Answered by Gribouillis 1,391 in a post from 13 Years Ago

Well that's just it. I have to use a template because I can't predict the sentence I am going to match. And so my code has to be generic; to work for anything. The user is supposed to be able to add their own rules to the text file for …

Answered by Gribouillis 1,391 in a post from 13 Years Ago

And, it doesn't work when say, the input is 'i suppose i am hungry' It only matches the exact case of 'i am hungry'. So how could I, in clearer terms, make it change from: input = rule, to: input in rule

Use match = myregex.search(sentence) instead of match = …

Answered by Gribouillis 1,391 in a post from 13 Years Ago

Thanks that's done it. And what about with multiple missing words?

well

>>> template = "can you <w1> <w2> <w3> something"
>>> regex = template_re.sub(template_helper, template)
>>> print regex
can you (?P<w1>\w+) (?P<w2>\w+) (?P<w3>\w+) something
>>> regex = re.compile(regex)
>>> match = regex.match("can you help me with …

All 35 Replies

Gribouillis 1,391 Programming Explorer

13 Years Ago

The simplest tool is regular expressions (the re module). For example

>>> import re
>>> state_re = re.compile(r"i am (?P<state>\w+)")
>>> match = state_re.match("i am hungry")
>>> print match.group("state")
hungry

This has serious limitations and you may need a real parser for your grammar rules. With a learning effort, you could use the "natural language toolkit" (the module nltk) which contains tools to parse english sentences. See http://www.nltk.org/

Edited 13 Years Ago by Gribouillis because: n/a

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Thanks that works. So if the first line was: i am <something> How can I change the <something> to (?P<something>\w+)? In other words, changing it so I can use the re.compile function?

And, how can I only make it so it performs the print when the match is true? I tried

if state_re.match(input_words) ==True:
    print match.group("state")

but it doesn't print anything regardless...

Edited 13 Years Ago by DevourOfDarknes because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Thanks that works. So if the first line was: i am <something> How can I change the <something> to (?P<something>\w+)? In other words, changing it so I can use the re.compile function?
And, how can I only make it so it performs the print when the match is true? I tried
if state_re.match(input_words) ==True:
    print match.group("state")
but it doesn't print anything regardless...

I think you will have to write the regular expressions by hand: parsing templates to produce valid regular expressions is another problem. The first thing to do is to read the re module documentation (also have a look in python tutorials).
The match() method returns a "match object" when the sentence matches and None if it doesn't, so the good test is if match is not None:... .

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Well that's just it. I have to use a template because I can't predict the sentence I am going to match. And so my code has to be generic; to work for anything. The user is supposed to be able to add their own rules to the text file for me to match with the input.

Gribouillis 1,391 Programming Explorer

13 Years Ago

Well that's just it. I have to use a template because I can't predict the sentence I am going to match. And so my code has to be generic; to work for anything. The user is supposed to be able to add their own rules to the text file for me to match with the input.

Then you can try simple regex substitution like this

>>> import re
>>> template_re = re.compile(r"<\w+>")
>>> def template_helper(match):
...  return "(?P{0}\w+)".format(match.group(0))
... 
>>> template = "i am <state>"
>>> regex = template_re.sub(template_helper, template)
>>> print repr(regex)
'i am (?P<state>\\w+)'
>>> regex = re.compile(regex)
>>> match = regex.match("i am hungry")
>>> match.group("state")
'hungry'

This could be a starting point.

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Thanks :) I THINK that's done what I am trying to do, but I need to continue writing the program to make sure.

EDIT: What about when there are multiple <> placeholder things? Like: you <w1> <w2> me

Edited 13 Years Ago by DevourOfDarknes because: n/a

DevourOfDarknes 0 Newbie Poster

13 Years Ago

And, it doesn't work when say, the input is 'i suppose i am hungry' It only matches the exact case of 'i am hungry'. So how could I, in clearer terms, make it change from: input = rule, to: input in rule

Gribouillis 1,391 Programming Explorer

13 Years Ago

And, it doesn't work when say, the input is 'i suppose i am hungry' It only matches the exact case of 'i am hungry'. So how could I, in clearer terms, make it change from: input = rule, to: input in rule

Use match = myregex.search(sentence) instead of match = myregex.match(sentence) . Regular expressions take some time to get used to: don't hesitate to explore the re module and exemples in tutorials. What about 'i am Nostradamus' ?

Edited 13 Years Ago by Gribouillis because: n/a

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Thanks that's done it. And what about with multiple missing words?

Edited 13 Years Ago by DevourOfDarknes because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Thanks that's done it. And what about with multiple missing words?

well

>>> template = "can you <w1> <w2> <w3> something"
>>> regex = template_re.sub(template_helper, template)
>>> print regex
can you (?P<w1>\w+) (?P<w2>\w+) (?P<w3>\w+) something
>>> regex = re.compile(regex)
>>> match = regex.match("can you help me with something")
>>> match.group("w2")
'me'

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Actually that might work because I only have to take into account up to 4 missing words, not an infinite amount like I was thinking.

Gribouillis 1,391 Programming Explorer

13 Years Ago

You can add some features, for example allow arbitrary white space

spaces = re.compile(r"\s+")
template = "can you <w1> <w2> <w3> something"
template = spaces.sub(r"\s+", template)
# the rest as before

this allows the user to enter

can   you help         me with     something

Edited 13 Years Ago by Gribouillis because: n/a

DevourOfDarknes 0 Newbie Poster

13 Years Ago

I had a problem but I fixed it so dw about this post ;)

Edited 13 Years Ago by DevourOfDarknes because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

I had a problem but I fixed it so dw about this post ;)

Notice that you can extract the words while sustituting in the template

from functools import partial
from pprint import pprint
import re
spaces_re = re.compile(r"\s+")
template_re = re.compile(r"<\w+>")

def template_helper(words, match):
    word = match.group(0)
    words.append(word[1:-1])
    return "(?P{0}\w+)".format(match.group(0))

def compile_template(template):
    template = spaces_re.sub(r"\s+", template)
    words = list()
    regex = template_re.sub(partial(template_helper, words), template)
    regex = re.compile(regex)
    return regex, words

everything = list()

with open('rules.txt', 'rU') as rules:
    for line in rules:
        cutstart, cutend = line.split("|", 1)
        cutstart, cutend = cutstart.rstrip(), cutend.lstrip()
        regex, words = compile_template(cutstart)
        everything.append((cutstart, cutend, words, regex))

pprint(everything)

""" my output -->
[('i am <state>',
  'How long have you been <state>?\n',
  ['state'],
  <_sre.SRE_Pattern object at 0x7fb012eadcf0>),
 ('you <w1> me',
  'What makes you think I <w1> you?\n',
  ['w1'],
  <_sre.SRE_Pattern object at 0x7fb012eb55e0>),
 ('you <w1> <w2> me',
  'What makes you think I <w1> <w2> you?\n',
  ['w1', 'w2'],
  <_sre.SRE_Pattern object at 0x7fb012eb7730>)]
"""

rules.txt (0.15 KB)

i am <state> | How long have you been <state>?
you <w1> me | What makes you think I <w1> you?
you <w1> <w2> me | What makes you think I <w1> <w2> you?

Edited 13 Years Ago by Gribouillis because: n/a

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Ok I've found a problem. When matching the input to the rule, the input needs to be in lower case so I would just use .lower() however when the variable is used in the response, the letter case of the variable must have been kept in tact. How can I do this?? :/

Edited 13 Years Ago by DevourOfDarknes because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Ok I've found a problem. When matching the input to the rule, the input needs to be in lower case so I would just use .lower() however when the variable is used in the response, the letter case of the variable must have been kept in tact. How can I do this?? :/

You could use x, y = match.span("w2") to get the position of the word w2 in the matched string, then retrieve the initial word with word = initial_input[x:y]

Edited 13 Years Ago by Gribouillis because: n/a

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Hmmm, that works but when the input is eg. 'my name is James.' the output (for a custom rule) should be 'Hi there James!'. Instead, I get 'Hi there James.!' I'm guessing I should strip the ends of punctuation... doing now...

EDIT: Yes that seems to work...
EDIT2: Nope it's stuffed it up :(

Edited 13 Years Ago by DevourOfDarknes because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Wait, so here's the part of my code where I add the variable to my dictionary.
if match is not None:
                for item in cutstart.split():
                    if item.startswith("<") and item.endswith(">"):
                        worda = item.strip("<>")
                        word1 = match.group(worda)
                        dict[item] = word1
Whereabouts would I put the match.span code?

Well, I understood that you convert input to lowercase, so your code should look like

lower_input = initial_input.lower()
match = regex.search(lower_input)
if match is not None:
    x, y = match.span(worda)
    word1 = initial_input[x:y]

Otherwise it would be better to extract the group names with regex while compiling the template as I wrote above (instead of splitting cutstart, ...)

Edited 13 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Hmmm, that works but when the input is eg. 'my name is James.' the output (for a custom rule) should be 'Hi there James!'. Instead, I get 'Hi there James.!' I'm guessing I should strip the ends of punctuation... doing now...
EDIT: Yes that seems to work...

To include punctuation in the cutstart part, use cutstart = cutstart.replace(".",r"\.")

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Well, I understood that you convert input to lowercase, so your code should look like

Thanks that solved the problem I was getting.

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Ok, I have 2 problems now. The first one is when I input,
'Do you like Python, like me?'
The output is 'What makes you think I like Python lik you?'
Note that though you can't see it, there are 2 spaces after 'Python'
When it should be 'What makes you think I like Python like you?'

In fact, when the input is
'Do you like Python like me?'
The output is 'What makes you think I like Python like you?'

So it's obviously a problem with the comma.

I'll talk about the second when this one get's solved.

Part of my code is like this:

lower_input = input_words.lower()
    for letter in string.punctuation:
        lower_input = lower_input.replace(letter, "")
template = cutstart
        if "<" and ">" in cutstart:    
            regex = template_re.sub(template_helper, template)
            regex = re.compile(regex)
            match = regex.search(lower_input)
            
            
            if match is not None:
                for item in cutstart.split():
                    if item.startswith("<") and item.endswith(">"):
                        worda = item.strip("<>")
                        x, y = match.span(worda)
                        word1 = (input_words[x:y]).strip(string.punctuation)
                        dict[item] = word1
                cutend = multiwordReplace(cutend, dict)
                if "<" in cutend:
                    continue
                else:    
                    input_words = raw_input(cutend + "> ")

Edited 13 Years Ago by DevourOfDarknes because: n/a

TrustyTony 888 ex-Moderator

13 Years Ago

I would not first of all use dict as variable as it hides built in type dict.

Gribouillis 1,391 Programming Explorer

13 Years Ago

Also why don't you post the whole code ? It would make things easier.

DevourOfDarknes 0 Newbie Poster

13 Years Ago

import re
import string
template_re = re.compile(r"<\w+>")
def template_helper(match):
    return "(?P{0}\w+)".format(match.group(0))

def multiwordReplace(cutend, dict):
    for key in dict:
        cutend = cutend.replace(key, dict[key])
    return cutend

dict = {}
input_words = raw_input("What would you like to talk about?\n" + "> ")
while input_words !="":
    lower_input = input_words.lower()
    for letter in string.punctuation:
        lower_input = lower_input.replace(letter, "")
    rules = open("rules.txt", "rU")
    go_on = True
    for line in rules:
        cutstart = line[:(line.index("|"))].rstrip()
        cutend = line[((line.index("|"))+1):].lstrip()
        
        template = cutstart
        if "<" and ">" in cutstart:    
            regex = template_re.sub(template_helper, template)
            regex = re.compile(regex)
            match = regex.search(lower_input)
            
            
            if match is not None:
                for item in cutstart.split():
                    if item.startswith("<") and item.endswith(">"):
                        worda = item.strip("<>")
                        x, y = match.span(worda)
                        word1 = (input_words[x:y]).strip(string.punctuation)
                        dict[item] = word1
                cutend = multiwordReplace(cutend, dict)
                if "<" in cutend:
                    continue
                else:    
                    input_words = raw_input(cutend + "> ")

                
                go_on = False
                
            if match is None:
                continue
        elif "<" and ">" not in cutstart:

            if cutstart in lower_input:
                go_on = False
                if "<" in cutend:
                    continue
                else:    
                    input_words = raw_input(cutend + "\n> ")
            else:
                continue
    if go_on:
        input_words = raw_input("Please go on.\n> ")

Rules:

i feel | How often do you feel that way?
i am <state> | How long have you been <state>?
you <w1> me | What makes you think I <w1> you?
you <w1> <w2> me | What makes you think I <w1> <w2> you?
you <w1> <w2> <w3> me | What makes you think I <w1> <w2> <w3> you?
you <w1> <w2> <w3> <w4> me | What makes you think I <w1> <w2> <w3> <w4> you?
go away | I hope I have helped you!

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Ok, so I changed the code around a bit and now this happens:
When I attempt to match the input to a rule, if some form of punctuation is part of the variable eg. 'don't' then it will refuse to match it. If I try:
Do you? like Dont, like me?
It works fine, returning:
What makes you think I like Dont like you?
However if I try:
Do you? like Don't, like me?
It fails to match the sentence, instead printing me the output I have set for when the sentence matches no rule.

It is as if the <state> variables do not allow for inside punctuation or something...

So, how can I allow it to still match the sentence to a rule successfully, when there is some form of punctuation in the word?

Note that I cannot just remove the punctuation entirely from the input because then when I specify the x, y variables, one sentence will have a different number of characters to the other, and so indexing like so will not work properly (it will produce the output I mentioned 2 posts ago)

TrustyTony 888 ex-Moderator

13 Years Ago

You might find it look the therapist code http://www.daniweb.com/software-development/python/threads/296088/1274431#post1274431, even it is done with little old programming style.

Edited 13 Years Ago by TrustyTony because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

If you want to allow "don't", you can replace \w+ with (?:\w|)+ in template_helper(). Experiment with regular expressions, you could install kodos for this, see http://kodos.sourceforge.net/. I think you're asking too much too fast from your program. Think about your program structure and algorithm. Think about a robust way to handle punctuation.

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Thanks! That worked. Now, is there a way I could allow for ALL middle punctuation eg. commas?

EDIT: I did it manually

return "(?P{0}(?:\w|[!#$%&'()*+,-./:;<=>?@[\]^_`|~])+)".format(match.group(0))

But for some reason when I put {} in it always gives me an error. Well, hopefully those won't be needed.

Edited 13 Years Ago by DevourOfDarknes because: n/a

DevourOfDarknes 0 Newbie Poster

13 Years Ago

Ok I am so close to finishing! But I am suddenly getting an error I wasn't getting before. I figured I did something to the code that stopped it working, so I used ctrl+z to undo everything I did today (because yesterday it was working). Strangely enough, I still get the error.

What happens is my program is looped to accept input right? Well, the first input the user does works perfectly. However, the next time the user inputs something, I always get this error:

Traceback (most recent call last):
  File "C:\Python26\Projects\Week 5\Question 5", line 28, in <module>
    cutstart = line[:(line.index("|"))].rstrip()
ValueError: substring not found

I don't really get how it can't find it the second time if it can indeed find it the first time. And also why I'm getting this error with the code from yesterday (I believe).

Anyway, I am pretty desperate atm and would absolutely LOVE help in getting rid of this error.

Here is my code: (Lots of comments soz)

import re
import string
template_re = re.compile(r"<\w+>")
def template_helper(match):
    return "(?P{0}(?:\w|[!#$%&'()*+,-./:;<=>?@[\]^_`|~])+)".format(match.group(0))

def multiwordReplace(cutend, dicto):
    for key in dicto:
        cutend = cutend.replace(key, dicto[key])
    return cutend

dicto = {}
input_words = raw_input("Hello, my name is Eliza. What would you like to talk about?\n> ")
while input_words !="":
    input_a = []
    dicti = {}
    for item in input_words.split():
        item = item.strip(string.punctuation)
        input_a.append(item)
    input_b = ' '.join(input_a)
    lower_input = input_b.lower()
#Input_b is the input with punctuation removed from sides of words.
#Lower_input  is input_b with all letters lowercase    
    rules = open("rules.txt", "rU")
    go_on = True    
#When go_on is True, unless it is changed to False by a match, it will print Please Go On    
    for line in rules:
        cutstart = line[:(line.index("|"))].rstrip()
#The line which will be used to match the input to the rule
        cutend = line[((line.index("|"))+1):].lstrip()
#The output line associated with that rule        
        template = cutstart
#The rule line used for matching  
        regex = template_re.sub(template_helper, template)
        regex = re.compile(regex)
        match = regex.search(lower_input)        
#Match with input            
#If the match is successful        
        if match is not None:            
#Make backup copies of the dictionary and rule            
            cutend1 = cutend
            dicti = dicto
            for item in cutstart.split():                
#Scan for the words that are variables in the rule                
                if item.startswith("<") and item.endswith(">"):
                        worda = item.strip("<>")                        
#Substitute that word for its match in the input                        
                        x, y = match.span(worda)
                        word1 = (input_b[x:y]).strip(string.punctuation)                        
#Add the item to the dictionary. If there turns out to be no match, the backup can still be used.                        
                        dicto[item] = word1                        
#Replace the words in the backup of the output with the ones in the dictionary.                        
            cutend1 = multiwordReplace(cutend1, dicto)                
#If <> is still present, this means a variable has not been defined and so this rule cannot be used            
            if "<" in cutend1:                
#Restore to backup copy of dictionary                
                dicto = dicti                
#Continue trying lines                
                continue
            else:                
#If no <> are present, this means all variables have been defined and so the replacing can safely be used on the actual output                
                cutend = multiwordReplace(cutend, dicto)
                input_words = raw_input(cutend + "> ")
                go_on = False
#If the input does not match the particular line being tested, move to the next one                
        if match is None:
            continue
            
    if go_on:
        input_words = raw_input("Please go on.\n> ")

And here is my rules.txt:

i feel | How often do you feel that way?
i am <state> | How long have you been <state>?
you <w1> me | What makes you think I <w1> you?
you <w1> <w2> me | What makes you think I <w1> <w2> you?
you <w1> <w2> <w3> me | What makes you think I <w1> <w2> <w3> you?
you <w1> <w2> <w3> <w4> me | What makes you think I <w1> <w2> <w3> <w4> you?
go away | I hope I have helped you!

I'm thinking maybe it's an indentation problem but I don't know :(

Edited 13 Years Ago by DevourOfDarknes because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Reply to this Topic