Hello all,

How are you? Hope well. Just a quick question. I have a file which contains abstracts (small texts) and i am looking for certain keywords and their frequency. These keywords are provided from another file. I was thinking to read the keyword file first and the perform a scan on the abstract file. However, if i find a keyword, i would like to find the word before and after as well. Any thoughts of how i can do that?

example

Keyword_file looks like this:

George
myself

abstract_file looks like this:

hello, my name is george. How are you? Today
i am not feeling very well. I consider myself to be
sick.

so i want to find the words 'george' and 'myself' as well as 'is','how' and 'consider', 'to'.

Any suggestions? :?:

Recommended Answers

All 10 Replies

You can iterate over the words, or build a list, with a regular expression (the re module)

import re

abstract = """hello, my name is george. How are you? Today
i am not feeling very well. I consider myself to be
sick."""

word_pattern = re.compile(r"\w+")

print list(word_pattern.findall(abstract))

""" my output -->
['hello', 'my', 'name', 'is', 'george', 'How', 'are', 'you', 'Today', 'i', 'am', 'not', 'feeling', 'very', 'well', 'I', 'consider', 'myself', 'to', 'be', 'sick']
"""

i try the list and i think within that area a solution can be found.however, i do not know how to search for the words after and before. for example:

filename='abex.txt' 
wordlist=re.split('\s+', file(filename).read().lower())
print 'words in text:', len(wordlist) 
print wordlist
print ""

filename2='singleheads.txt'
wordlist2=re.split('\s+', file(filename2).read().lower())
print 'words in the single head file:', len(wordlist2)
print wordlist2

    
for word in wordlist2:
    if word in wordlist:
       print word

you see this code prints only the common ones which is super. But i can not understand how can i ask the algorithm to fetch the words before and after the keywords. is there a function that can do that?:?:

You can use enumerate. Supposing you have 2 lists, keywords and word_list,

keyword_set = set(keywords) # better use a set
for i, w in enumerate(word_list):
    if w in keyword_set:
        word_before = word_list[i-1] if i > 0 else ''
        word_after = word_list[i+1] if i+1 < len(word_list) else ''
        print("%s <%s> %s" % (word_before, w, word_after))

thanks! it worked awesome! but here is another problem. you see with this code

filename='abex.txt' 
wordlist=re.split('\s+', file(filename).read().lower()) #ta kanei lowercase
filename2='singleheads.txt'
wordlist2=re.split('\s+', file(filename2).read().lower())

punctuation=re.compile(r'[.?!,":;]')   #remove the punctuation
for word in wordlist:
    word=punctuation.sub("",word) 

keyword_set = set(wordlist2)
for i,w in enumerate(wordlist): #it gives to the list items numbers
    if w in keyword_set:
        before_word = wordlist[i-1] if i > 0 else ''
        after_word = wordlist[i+1] if i+1 < len(wordlist) else ''
        print "%s <%s> %s" % (before_word,w,after_word)

it brings nice results! but words that may have with them a .?!" are excluded since the keyword list has "clean" words. i tried to fix it with

punctuation=re.compile(r'[.?!,":;]')   #remove the punctuation
for word in wordlist:
    word=punctuation.sub("",word)

with three lines and it seems to work fine. but i can not make a connection between the new wordlist(the clean one now) and set. Keep hitting my head on the wall but i can not find a way. Something says it is going to be very very simple!:sweat:

Forget about punctuation: build the words list as I did in my first post above, using the regex r'\w+'.

i did and it works fine! thank you very much for your help :-):cool: for the record this is my final algorithm:

import re
terms='singleheads.txt'
wordlist=re.split('\s+', file(terms).read().lower())

abstract=open('abex.txt','r')
abstract2=abstract.read().lower() 
abstract3=str(file2)

word_pattern = re.compile(r"\w+")
doom=list(word_pattern.findall(abstract3))
print doom
print ""

keyword_set = set(wordlist)
for i,w in enumerate(doom): #it gives to the list items numbers
    if w in keyword_set:
        before_word = doom[i-1] if i > 0 else ''
        after_word = doom[i+1] if i+1 < len(doom) else ''
        print "%s <%s> %s" % (before_word,w,after_word)
        sephiroth=open('staib.txt','a')
        sephiroth.write(str(before_word)+ " "+ "<" + str(w) + ">" + str(after_word) + "\n")
        sephiroth.close()

thanks for the help!

Nice, but you should not open 'staib.txt' for each iteration. Open it before the loop starts and close it after the loop. Opening a file is an expensive system call.

ok :-) thanks for the tip :-))

print list(word_pattern.findall(abstract))

Just a tips.
re.findall is returning a list,so there is not necessary to use list().

import re

text = '''\
hello, my name is george. How are you? Today
i am not feeling very well. I consider myself to be
sick.
'''

word_pattern = re.findall(r'\w+', text)
print word_pattern

""" Out-->
['hello', 'my', 'name', 'is', 'george', 'How', 'are', 'you', 'Today', 'i', 'am', 'not', 'feeling', 'very', 'well', 'I', 'consider', 'myself', 'to', 'be', 'sick']
"""
commented: indeed ! +4

thanks snippsat :-)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.