Hi,
This is just part of a bigger program I'm writing, but at the moment I don't know how to do what I'm about to say.
So first, the user enters a sentence. Now, I have a text file with sentences to match it to. So let's say the first line in the text file is:
i am <state> | Do you often feel <state>?

So what I need to do is check whether the input matches the part before the | and if so, print the part after the |. The <state> is represented by any word. So if the user inputs 'i am hungry', then <state> would be hungry, and the output would be 'Do you often feel hungry'. The problem is I don't know how to check the input like that when I don't know one of the words. I can't remove the <state> and use indexing to get the next word because:
Another rule may not have the missing word at the start or end eg.
can you <w1> <w2> <w3> something | Why should I <w1> <w2> <w3> something?
The input might be 'can you help me with something' and the output would then by 'Why should I help you with something?

And one requirement of the program is I need to store the word(s) in question for later use, I'm assuming as a dictionary. So taking the above into account, the dicionary would be {<state>: 'hungry', <w1>: 'help', <w2>: 'me', <w3>: 'with'} This would assist in printing the output as I could just define any word with <> and as the user said it, the definition would be the word I need.

So how could I do such a task as to match sentences when 1 or more words is not known?

Recommended Answers

All 35 Replies

The simplest tool is regular expressions (the re module). For example

>>> import re
>>> state_re = re.compile(r"i am (?P<state>\w+)")
>>> match = state_re.match("i am hungry")
>>> print match.group("state")
hungry

This has serious limitations and you may need a real parser for your grammar rules. With a learning effort, you could use the "natural language toolkit" (the module nltk) which contains tools to parse english sentences. See http://www.nltk.org/

Thanks that works. So if the first line was: i am <something> How can I change the <something> to (?P<something>\w+)? In other words, changing it so I can use the re.compile function?

And, how can I only make it so it performs the print when the match is true? I tried

if state_re.match(input_words) ==True:
    print match.group("state")

but it doesn't print anything regardless...

Thanks that works. So if the first line was: i am <something> How can I change the <something> to (?P<something>\w+)? In other words, changing it so I can use the re.compile function?

And, how can I only make it so it performs the print when the match is true? I tried

if state_re.match(input_words) ==True:
    print match.group("state")

but it doesn't print anything regardless...

I think you will have to write the regular expressions by hand: parsing templates to produce valid regular expressions is another problem. The first thing to do is to read the re module documentation (also have a look in python tutorials).
The match() method returns a "match object" when the sentence matches and None if it doesn't, so the good test is if match is not None:... .

Well that's just it. I have to use a template because I can't predict the sentence I am going to match. And so my code has to be generic; to work for anything. The user is supposed to be able to add their own rules to the text file for me to match with the input.

Well that's just it. I have to use a template because I can't predict the sentence I am going to match. And so my code has to be generic; to work for anything. The user is supposed to be able to add their own rules to the text file for me to match with the input.

Then you can try simple regex substitution like this

>>> import re
>>> template_re = re.compile(r"<\w+>")
>>> def template_helper(match):
...  return "(?P{0}\w+)".format(match.group(0))
... 
>>> template = "i am <state>"
>>> regex = template_re.sub(template_helper, template)
>>> print repr(regex)
'i am (?P<state>\\w+)'
>>> regex = re.compile(regex)
>>> match = regex.match("i am hungry")
>>> match.group("state")
'hungry'

This could be a starting point.

Thanks :) I THINK that's done what I am trying to do, but I need to continue writing the program to make sure.

EDIT: What about when there are multiple <> placeholder things? Like: you <w1> <w2> me

And, it doesn't work when say, the input is 'i suppose i am hungry' It only matches the exact case of 'i am hungry'. So how could I, in clearer terms, make it change from: input = rule, to: input in rule

And, it doesn't work when say, the input is 'i suppose i am hungry' It only matches the exact case of 'i am hungry'. So how could I, in clearer terms, make it change from: input = rule, to: input in rule

Use match = myregex.search(sentence) instead of match = myregex.match(sentence) . Regular expressions take some time to get used to: don't hesitate to explore the re module and exemples in tutorials. What about 'i am Nostradamus' ?

Thanks that's done it. And what about with multiple missing words?

Thanks that's done it. And what about with multiple missing words?

well

>>> template = "can you <w1> <w2> <w3> something"
>>> regex = template_re.sub(template_helper, template)
>>> print regex
can you (?P<w1>\w+) (?P<w2>\w+) (?P<w3>\w+) something
>>> regex = re.compile(regex)
>>> match = regex.match("can you help me with something")
>>> match.group("w2")
'me'

Actually that might work because I only have to take into account up to 4 missing words, not an infinite amount like I was thinking.

You can add some features, for example allow arbitrary white space

spaces = re.compile(r"\s+")
template = "can you <w1> <w2> <w3> something"
template = spaces.sub(r"\s+", template)
# the rest as before

this allows the user to enter

can   you help         me with     something

I had a problem but I fixed it so dw about this post ;)

I had a problem but I fixed it so dw about this post ;)

Notice that you can extract the words while sustituting in the template

from functools import partial
from pprint import pprint
import re
spaces_re = re.compile(r"\s+")
template_re = re.compile(r"<\w+>")

def template_helper(words, match):
    word = match.group(0)
    words.append(word[1:-1])
    return "(?P{0}\w+)".format(match.group(0))

def compile_template(template):
    template = spaces_re.sub(r"\s+", template)
    words = list()
    regex = template_re.sub(partial(template_helper, words), template)
    regex = re.compile(regex)
    return regex, words

everything = list()

with open('rules.txt', 'rU') as rules:
    for line in rules:
        cutstart, cutend = line.split("|", 1)
        cutstart, cutend = cutstart.rstrip(), cutend.lstrip()
        regex, words = compile_template(cutstart)
        everything.append((cutstart, cutend, words, regex))

pprint(everything)

""" my output -->
[('i am <state>',
  'How long have you been <state>?\n',
  ['state'],
  <_sre.SRE_Pattern object at 0x7fb012eadcf0>),
 ('you <w1> me',
  'What makes you think I <w1> you?\n',
  ['w1'],
  <_sre.SRE_Pattern object at 0x7fb012eb55e0>),
 ('you <w1> <w2> me',
  'What makes you think I <w1> <w2> you?\n',
  ['w1', 'w2'],
  <_sre.SRE_Pattern object at 0x7fb012eb7730>)]
"""

Ok I've found a problem. When matching the input to the rule, the input needs to be in lower case so I would just use .lower() however when the variable is used in the response, the letter case of the variable must have been kept in tact. How can I do this?? :/

Ok I've found a problem. When matching the input to the rule, the input needs to be in lower case so I would just use .lower() however when the variable is used in the response, the letter case of the variable must have been kept in tact. How can I do this?? :/

You could use x, y = match.span("w2") to get the position of the word w2 in the matched string, then retrieve the initial word with word = initial_input[x:y]

Hmmm, that works but when the input is eg. 'my name is James.' the output (for a custom rule) should be 'Hi there James!'. Instead, I get 'Hi there James.!' I'm guessing I should strip the ends of punctuation... doing now...

EDIT: Yes that seems to work...
EDIT2: Nope it's stuffed it up :(

Wait, so here's the part of my code where I add the variable to my dictionary.

if match is not None:
                for item in cutstart.split():
                    if item.startswith("<") and item.endswith(">"):
                        worda = item.strip("<>")
                        word1 = match.group(worda)
                        dict[item] = word1

Whereabouts would I put the match.span code?

Well, I understood that you convert input to lowercase, so your code should look like

lower_input = initial_input.lower()
match = regex.search(lower_input)
if match is not None:
    x, y = match.span(worda)
    word1 = initial_input[x:y]

Otherwise it would be better to extract the group names with regex while compiling the template as I wrote above (instead of splitting cutstart, ...)

Hmmm, that works but when the input is eg. 'my name is James.' the output (for a custom rule) should be 'Hi there James!'. Instead, I get 'Hi there James.!' I'm guessing I should strip the ends of punctuation... doing now...

EDIT: Yes that seems to work...

To include punctuation in the cutstart part, use cutstart = cutstart.replace(".",r"\.")

Well, I understood that you convert input to lowercase, so your code should look like

Thanks that solved the problem I was getting.

Ok, I have 2 problems now. The first one is when I input,
'Do you like Python, like me?'
The output is 'What makes you think I like Python lik you?'
Note that though you can't see it, there are 2 spaces after 'Python'
When it should be 'What makes you think I like Python like you?'

In fact, when the input is
'Do you like Python like me?'
The output is 'What makes you think I like Python like you?'

So it's obviously a problem with the comma.

I'll talk about the second when this one get's solved.

Part of my code is like this:

lower_input = input_words.lower()
    for letter in string.punctuation:
        lower_input = lower_input.replace(letter, "")
template = cutstart
        if "<" and ">" in cutstart:    
            regex = template_re.sub(template_helper, template)
            regex = re.compile(regex)
            match = regex.search(lower_input)
            
            
            if match is not None:
                for item in cutstart.split():
                    if item.startswith("<") and item.endswith(">"):
                        worda = item.strip("<>")
                        x, y = match.span(worda)
                        word1 = (input_words[x:y]).strip(string.punctuation)
                        dict[item] = word1
                cutend = multiwordReplace(cutend, dict)
                if "<" in cutend:
                    continue
                else:    
                    input_words = raw_input(cutend + "> ")

I would not first of all use dict as variable as it hides built in type dict.

Also why don't you post the whole code ? It would make things easier.

import re
import string
template_re = re.compile(r"<\w+>")
def template_helper(match):
    return "(?P{0}\w+)".format(match.group(0))

def multiwordReplace(cutend, dict):
    for key in dict:
        cutend = cutend.replace(key, dict[key])
    return cutend

dict = {}
input_words = raw_input("What would you like to talk about?\n" + "> ")
while input_words !="":
    lower_input = input_words.lower()
    for letter in string.punctuation:
        lower_input = lower_input.replace(letter, "")
    rules = open("rules.txt", "rU")
    go_on = True
    for line in rules:
        cutstart = line[:(line.index("|"))].rstrip()
        cutend = line[((line.index("|"))+1):].lstrip()
        
        template = cutstart
        if "<" and ">" in cutstart:    
            regex = template_re.sub(template_helper, template)
            regex = re.compile(regex)
            match = regex.search(lower_input)
            
            
            if match is not None:
                for item in cutstart.split():
                    if item.startswith("<") and item.endswith(">"):
                        worda = item.strip("<>")
                        x, y = match.span(worda)
                        word1 = (input_words[x:y]).strip(string.punctuation)
                        dict[item] = word1
                cutend = multiwordReplace(cutend, dict)
                if "<" in cutend:
                    continue
                else:    
                    input_words = raw_input(cutend + "> ")

                
                go_on = False
                
            if match is None:
                continue
        elif "<" and ">" not in cutstart:

            if cutstart in lower_input:
                go_on = False
                if "<" in cutend:
                    continue
                else:    
                    input_words = raw_input(cutend + "\n> ")
            else:
                continue
    if go_on:
        input_words = raw_input("Please go on.\n> ")

Rules:

i feel | How often do you feel that way?
i am <state> | How long have you been <state>?
you <w1> me | What makes you think I <w1> you?
you <w1> <w2> me | What makes you think I <w1> <w2> you?
you <w1> <w2> <w3> me | What makes you think I <w1> <w2> <w3> you?
you <w1> <w2> <w3> <w4> me | What makes you think I <w1> <w2> <w3> <w4> you?
go away | I hope I have helped you!

Ok, so I changed the code around a bit and now this happens:
When I attempt to match the input to a rule, if some form of punctuation is part of the variable eg. 'don't' then it will refuse to match it. If I try:
Do you? like Dont, like me?
It works fine, returning:
What makes you think I like Dont like you?
However if I try:
Do you? like Don't, like me?
It fails to match the sentence, instead printing me the output I have set for when the sentence matches no rule.

It is as if the <state> variables do not allow for inside punctuation or something...

So, how can I allow it to still match the sentence to a rule successfully, when there is some form of punctuation in the word?

Note that I cannot just remove the punctuation entirely from the input because then when I specify the x, y variables, one sentence will have a different number of characters to the other, and so indexing like so will not work properly (it will produce the output I mentioned 2 posts ago)

If you want to allow "don't", you can replace \w+ with (?:\w|)+ in template_helper(). Experiment with regular expressions, you could install kodos for this, see http://kodos.sourceforge.net/. I think you're asking too much too fast from your program. Think about your program structure and algorithm. Think about a robust way to handle punctuation.

Thanks! That worked. Now, is there a way I could allow for ALL middle punctuation eg. commas?

EDIT: I did it manually

return "(?P{0}(?:\w|[!#$%&'()*+,-./:;<=>?@[\]^_`|~])+)".format(match.group(0))

But for some reason when I put {} in it always gives me an error. Well, hopefully those won't be needed.

Ok I am so close to finishing! But I am suddenly getting an error I wasn't getting before. I figured I did something to the code that stopped it working, so I used ctrl+z to undo everything I did today (because yesterday it was working). Strangely enough, I still get the error.

What happens is my program is looped to accept input right? Well, the first input the user does works perfectly. However, the next time the user inputs something, I always get this error:

Traceback (most recent call last):
  File "C:\Python26\Projects\Week 5\Question 5", line 28, in <module>
    cutstart = line[:(line.index("|"))].rstrip()
ValueError: substring not found

I don't really get how it can't find it the second time if it can indeed find it the first time. And also why I'm getting this error with the code from yesterday (I believe).

Anyway, I am pretty desperate atm and would absolutely LOVE help in getting rid of this error.

Here is my code: (Lots of comments soz)

import re
import string
template_re = re.compile(r"<\w+>")
def template_helper(match):
    return "(?P{0}(?:\w|[!#$%&'()*+,-./:;<=>?@[\]^_`|~])+)".format(match.group(0))

def multiwordReplace(cutend, dicto):
    for key in dicto:
        cutend = cutend.replace(key, dicto[key])
    return cutend

dicto = {}
input_words = raw_input("Hello, my name is Eliza. What would you like to talk about?\n> ")
while input_words !="":
    input_a = []
    dicti = {}
    for item in input_words.split():
        item = item.strip(string.punctuation)
        input_a.append(item)
    input_b = ' '.join(input_a)
    lower_input = input_b.lower()
#Input_b is the input with punctuation removed from sides of words.
#Lower_input  is input_b with all letters lowercase    
    rules = open("rules.txt", "rU")
    go_on = True    
#When go_on is True, unless it is changed to False by a match, it will print Please Go On    
    for line in rules:
        cutstart = line[:(line.index("|"))].rstrip()
#The line which will be used to match the input to the rule
        cutend = line[((line.index("|"))+1):].lstrip()
#The output line associated with that rule        
        template = cutstart
#The rule line used for matching  
        regex = template_re.sub(template_helper, template)
        regex = re.compile(regex)
        match = regex.search(lower_input)        
#Match with input            
#If the match is successful        
        if match is not None:            
#Make backup copies of the dictionary and rule            
            cutend1 = cutend
            dicti = dicto
            for item in cutstart.split():                
#Scan for the words that are variables in the rule                
                if item.startswith("<") and item.endswith(">"):
                        worda = item.strip("<>")                        
#Substitute that word for its match in the input                        
                        x, y = match.span(worda)
                        word1 = (input_b[x:y]).strip(string.punctuation)                        
#Add the item to the dictionary. If there turns out to be no match, the backup can still be used.                        
                        dicto[item] = word1                        
#Replace the words in the backup of the output with the ones in the dictionary.                        
            cutend1 = multiwordReplace(cutend1, dicto)                
#If <> is still present, this means a variable has not been defined and so this rule cannot be used            
            if "<" in cutend1:                
#Restore to backup copy of dictionary                
                dicto = dicti                
#Continue trying lines                
                continue
            else:                
#If no <> are present, this means all variables have been defined and so the replacing can safely be used on the actual output                
                cutend = multiwordReplace(cutend, dicto)
                input_words = raw_input(cutend + "> ")
                go_on = False
#If the input does not match the particular line being tested, move to the next one                
        if match is None:
            continue
            
    if go_on:
        input_words = raw_input("Please go on.\n> ")

And here is my rules.txt:

i feel | How often do you feel that way?
i am <state> | How long have you been <state>?
you <w1> me | What makes you think I <w1> you?
you <w1> <w2> me | What makes you think I <w1> <w2> you?
you <w1> <w2> <w3> me | What makes you think I <w1> <w2> <w3> you?
you <w1> <w2> <w3> <w4> me | What makes you think I <w1> <w2> <w3> <w4> you?
go away | I hope I have helped you!

I'm thinking maybe it's an indentation problem but I don't know :(

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.