Hello,

I have a txt file which contains data in this format:

cross-sectional study 21225114.txt
prospective cross-sectional study 21225178.txt
cross-sectional study 21225178.txt
retrospective cohort 21225558.txt
retrospective cohort study 21225558.txt
cohort study 21225558.txt

This shows what type of study each of the txt files have. Problem is some have more that one study type which happens to be a sub type i.e,

prospective cross-sectional study 21225178.txt
cross-sectional study 21225178.txt

Thus, i want to have only one type of study for one txt file which has to be the more specific like this:

prospective cross-sectional study 21225178.txt
retrospective cohort study 21225558.txt

Any thoughts on how can i do that? I am completely stuck :( :( I would have done it with line.split()[something] but the columns keep changing due to more specific studies.

Split on "study", or a normal split on space(s) will also work as the number.txt will be -1, or the study's description will be everything except the last element.

Edited 5 Years Ago by woooee: n/a

i see. however how am i supposed to keep the longest study type for each file? Since i have two or three types for the same file, how can i say :

if you seen this file more than once, then keep the more specific study type?

I suggest that you start by reading the data and producing a dictionary like this one

{
  "21225114.txt" : [('cross-sectional', 'study')],
  "21225178.txt" : [('cross-sectional', 'study'), ('prospective', 'cross-sectional', 'study')],
  "21225558.txt" : [('retrospective', 'cohort'), ('retrospective', 'cohort', ' study'), ('cohort', 'study')]
}

This part is easy. Then think about an algorithm to choose the best tuple for each txt file.

Edited 5 Years Ago by Gribouillis: n/a

Ok this is what i did and it works just fine! :D Thank you for replying with the dictionary approach but i started making it more complex and i got lost. Then finally it struck me that i could do the following:to use the length of the sentence to my advantage ^_^

last_line = ""                 #i have set three variables here, one or the length, one
last_len = 0                      #for the line and one for choosing the best line-which
best = ""                                #is the most specific study for each abstract
for line in open('spans.txt','r'):                 #open the file of interest
    study = line.split()
    if last_line == study[-1] :                 #compare if the <[0-9]>.txt file of the line is the same with that one of the next one
        if len(study) > last_len :             #if its lenght is bigger than that one before
            best = study                        
            last_len = len(study)             #assign the new one as best
    else :
        print best                  #or else print the best one
        best = study
        last_line = study[-1]
        last_len = len(study)
print best

thanks for the help guys :)

Ok this is what i did and it works just fine! :D Thank you for replying with the dictionary approach but i started making it more complex and i got lost. Then finally it struck me that i could do the following:to use the length of the sentence to my advantage ^_^

last_line = ""                 #i have set three variables here, one or the length, one
last_len = 0                      #for the line and one for choosing the best line-which
best = ""                                #is the most specific study for each abstract
for line in open('spans.txt','r'):                 #open the file of interest
    study = line.split()
    if last_line == study[-1] :                 #compare if the <[0-9]>.txt file of the line is the same with that one of the next one
        if len(study) > last_len :             #if its lenght is bigger than that one before
            best = study                        
            last_len = len(study)             #assign the new one as best
    else :
        print best                  #or else print the best one
        best = study
        last_line = study[-1]
        last_len = len(study)
print best

thanks for the help guys :)

All right. Here is the code to create the dictionary

from collections import defaultdict

def create_dict(filename):
    D = defaultdict(list)
    with open(filename) as fin:
        for line in fin:
            L = line.strip().split()
            D[L[-1]].append(tuple(L[:-1]))
    return D

A problem with your algorithm is when a file has 2 descriptions, say "retrospective cohort" and "cohort study". None of them is included in the other and your algorithm will select the first because the word "retrospective" is longer than "study". How can you be sure that this doesn't happen ?
Edit: sorry, it will select the first because it comes first in the list since the lists ["retrospective", "cohort"] and ["cohort", "study"] have the same length. Don't you think it would be better in this case to generate a "retrospective cohort study" description ?

Edited 5 Years Ago by Gribouillis: n/a

thanks and you are right. The moment a file has more than one lines with the same lengths it will choose the first one. It was a risk i was willing to take :)

However after the implementation of this algorithm, i am going to normalize the study types so for example if i have [retrospective cohort] i will make it become a [retrospective cohort study].

Thank you for the code though. I will try it and let you know. Perhaps could be easier your way :)

This question has already been answered. Start a new discussion instead.