Dear all,

may be someone can help me to find a solution to the following problem:
I have a list of patterns (len=5) that are presented as tuples in a list, e.g.
patterns = [('w1','X1','w1','Y1','w1'), ('w2','w2','X2','w2','Y2'), ('w2','X2','w2','Y2','w2')]
I want to go through all sentences in a text file (one sentence per line) and extract all occurrences of these patterns in each sentence. The problem is that all words (w1,w2) in each pattern have to be exactly the same except for the elements X1, X2, Y1, etc. because what I want to know is which words occur in these places.
I can check for each line in file: whether each element of the pattern is in it. But how do I deal with placeholders X & Y? I can't think of anything to solve this :/ Can anyone help me or point me in the right direction?!

Thank you in advance!


#tuples with patterns, the unknown element is an empty string ''
patterns = [('w1','','w3','','w5'), ('w7','w8','','w10',''), ('w8','','w10','',w12)]

sent1: w1 A w3 B w5 w6 w7 w8
sent2: w1 w2 w3 w4 w5 w6 w7 w8 C w10 D w12

#extracted patterns with new words instead of empty strings
extracted_patterns =

does this give u a hint?

>>> pattern
('w1', '', 'w3', '', 'w5')
>>> line
'w1 A w3 B w5 w6 w7 w8'
>>> for index in range(len(pattern)):
	if pattern[index] in ['', line.split(' ')[index]]:
		print line.split(' ')[index]


Here is the solution I came up with:

>>> lines
['w1 A w3 B w5 w6 w7 w8', 'w1 w2 w3 w4 w5 w6 w7 w8 C w10 D w12']
>>> patterns
[('w1', '', 'w3', '', 'w5'), ('w7', 'w8', '', 'w10', ''), ('w8', '', 'w10', '', 'w12')]
>>> def get_patterns():
	extract = []
	for pattern in patterns:
		for line in lines:
			index = 0
			iter_line = line.split(' ')
			temp = ''
			while (len(iter_line) - index) >= len(pattern):
				for item in iter_line[index:(len(pattern) + index)]:
					i = (iter_line.index(item) - index)
					if pattern[i] not in ['',item]:
					temp += ' ' + item
				temp = temp.strip()
				if len(temp.split(' ')) == len(pattern) and temp not in extract:
				index += 1
	return extract

>>> get_patterns()
['w1 A w3 B w5', 'w1 w2 w3 w4 w5', 'w7 w8 C w10 D', 'w8 C w10 D w12']

I could not figure out why you did not include 'w1 w2 w3 w4 w5' in your example output, unless you only wanted 1 return per pattern? If thats the case this logic can be rearranged somewhat to give u a simular anwser... granted there is probally a better way to get this done..

Dear lukerobi, thank you very much for a detailed answer!!! Your solution works :) In the future, may be I can do this more efficiently, then I will post my solution here.
You were right about the 'w1 w2 w3 w4 w5' pattern - indeed it should have been in the output of my example.

The notation (or a list with two elements that you can use to check whether an element is one or the other, right?) was new to me :)

Thanks again,

You are very welcome... I am glad to help :)

Hello again,
today I finally succeeded at re-adjusting the program suggested earlier. I wanted to make several changes because:
- the program didn't work when the same word occurred more than once (as index always got the first occurrence of that word)
- when a pattern successfully matched, the next iteration started with the element that was already in the pattern. That is, given
pattern = [('w1','','w3','','w5')]
sentence =
when the first 5 elements of the sentence were extracted as a candidate pattern, the program continued with w2 (comparing it to 'w1','','w3', etc.). Instead, I wanted to continue with element w6.
- the program compared 5 elements of a pattern to 5 consequent elements in a sentence. Now, it compares the first element of a pattern to every consequent element of a sentence until they match.

Here is the code :)
Pls, let me know if I can improve it any further!

patterntuples = [('w1', '', 'w3', '', 'w5')]
sentences = ['ww w1 A w3 B w5 w6 w7 w8 w9 w10 w1 w2 w3 w4 w5 w17']

def get_patterns(patterntuples,sentences):
    extracted = []
    for pattern in patterntuples:
        for sent in sentences:
            index = 0 #starting position in the sentence
            splittedline = sent.split(' ')
            while (len(splittedline) - index) >= len(pattern):
                temp = []
                for nr, word in enumerate(splittedline[index:(len(pattern) + index)]):
                    if pattern[nr] in ['',word]:
                if len(temp) == len(pattern):
                    temp = []
                    index = index + len(pattern)
    return extracted

Have a nice day,