Split on "study", or a normal split on space(s) will also work as the number.txt will be -1, or the study's description will be everything except the last element.
woooee
Nearly a Posting Maven
2,454 posts since Dec 2006
Reputation Points: 777
Solved Threads: 714
I suggest that you start by reading the data and producing a dictionary like this one
{
"21225114.txt" : [('cross-sectional', 'study')],
"21225178.txt" : [('cross-sectional', 'study'), ('prospective', 'cross-sectional', 'study')],
"21225558.txt" : [('retrospective', 'cohort'), ('retrospective', 'cohort', ' study'), ('cohort', 'study')]
}
This part is easy. Then think about an algorithm to choose the best tuple for each txt file.
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
Ok this is what i did and it works just fine! :D Thank you for replying with the dictionary approach but i started making it more complex and i got lost. Then finally it struck me that i could do the following:to use the length of the sentence to my advantage ^_^
last_line = "" #i have set three variables here, one or the length, one
last_len = 0 #for the line and one for choosing the best line-which
best = "" #is the most specific study for each abstract
for line in open('spans.txt','r'): #open the file of interest
study = line.split()
if last_line == study[-1] : #compare if the <[0-9]>.txt file of the line is the same with that one of the next one
if len(study) > last_len : #if its lenght is bigger than that one before
best = study
last_len = len(study) #assign the new one as best
else :
print best #or else print the best one
best = study
last_line = study[-1]
last_len = len(study)
print best
thanks for the help guys :)
All right. Here is the code to create the dictionary
from collections import defaultdict
def create_dict(filename):
D = defaultdict(list)
with open(filename) as fin:
for line in fin:
L = line.strip().split()
D[L[-1]].append(tuple(L[:-1]))
return D
A problem with your algorithm is when a file has 2 descriptions, say "retrospective cohort" and "cohort study". None of them is included in the other and your algorithm will select the first because the word "retrospective" is longer than "study". How can you be sure that this doesn't happen ?
Edit: sorry, it will select the first because it comes first in the list since the lists ["retrospective", "cohort"] and ["cohort", "study"] have the same length. Don't you think it would be better in this case to generate a "retrospective cohort study" description ?
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691