954,525 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

reading lines and store values

Hello,

I have a txt file which contains data in this format:

cross-sectional study 21225114.txt
prospective cross-sectional study 21225178.txt
cross-sectional study 21225178.txt
retrospective cohort 21225558.txt
retrospective cohort study 21225558.txt
cohort study 21225558.txt

This shows what type of study each of the txt files have. Problem is some have more that one study type which happens to be a sub type i.e,

prospective cross-sectional study 21225178.txt
cross-sectional study 21225178.txt


Thus, i want to have only one type of study for one txt file which has to be the more specific like this:

prospective cross-sectional study 21225178.txt
retrospective cohort study 21225558.txt

Any thoughts on how can i do that? I am completely stuck :( :( I would have done it with line.split()[something] but the columns keep changing due to more specific studies.

doomas10
Newbie Poster
21 posts since Jul 2010
Reputation Points: 10
Solved Threads: 0
 

Split on "study", or a normal split on space(s) will also work as the number.txt will be -1, or the study's description will be everything except the last element.

woooee
Nearly a Posting Maven
2,454 posts since Dec 2006
Reputation Points: 777
Solved Threads: 714
 

i see. however how am i supposed to keep the longest study type for each file? Since i have two or three types for the same file, how can i say :

if you seen this file more than once, then keep the more specific study type?

doomas10
Newbie Poster
21 posts since Jul 2010
Reputation Points: 10
Solved Threads: 0
 

I suggest that you start by reading the data and producing a dictionary like this one

{
  "21225114.txt" : [('cross-sectional', 'study')],
  "21225178.txt" : [('cross-sectional', 'study'), ('prospective', 'cross-sectional', 'study')],
  "21225558.txt" : [('retrospective', 'cohort'), ('retrospective', 'cohort', ' study'), ('cohort', 'study')]
}

This part is easy. Then think about an algorithm to choose the best tuple for each txt file.

Gribouillis
Posting Maven
Moderator
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
 

Ok this is what i did and it works just fine! :D Thank you for replying with the dictionary approach but i started making it more complex and i got lost. Then finally it struck me that i could do the following:to use the length of the sentence to my advantage ^_^

last_line = ""                 #i have set three variables here, one or the length, one
last_len = 0                      #for the line and one for choosing the best line-which
best = ""                                #is the most specific study for each abstract
for line in open('spans.txt','r'):                 #open the file of interest
    study = line.split()
    if last_line == study[-1] :                 #compare if the <[0-9]>.txt file of the line is the same with that one of the next one
        if len(study) > last_len :             #if its lenght is bigger than that one before
            best = study                        
            last_len = len(study)             #assign the new one as best
    else :
        print best                  #or else print the best one
        best = study
        last_line = study[-1]
        last_len = len(study)
print best


thanks for the help guys :)

doomas10
Newbie Poster
21 posts since Jul 2010
Reputation Points: 10
Solved Threads: 0
 

Ok this is what i did and it works just fine! :D Thank you for replying with the dictionary approach but i started making it more complex and i got lost. Then finally it struck me that i could do the following:to use the length of the sentence to my advantage ^_^

last_line = ""                 #i have set three variables here, one or the length, one
last_len = 0                      #for the line and one for choosing the best line-which
best = ""                                #is the most specific study for each abstract
for line in open('spans.txt','r'):                 #open the file of interest
    study = line.split()
    if last_line == study[-1] :                 #compare if the <[0-9]>.txt file of the line is the same with that one of the next one
        if len(study) > last_len :             #if its lenght is bigger than that one before
            best = study                        
            last_len = len(study)             #assign the new one as best
    else :
        print best                  #or else print the best one
        best = study
        last_line = study[-1]
        last_len = len(study)
print best

thanks for the help guys :)


All right. Here is the code to create the dictionary

from collections import defaultdict

def create_dict(filename):
    D = defaultdict(list)
    with open(filename) as fin:
        for line in fin:
            L = line.strip().split()
            D[L[-1]].append(tuple(L[:-1]))
    return D

A problem with your algorithm is when a file has 2 descriptions, say "retrospective cohort" and "cohort study". None of them is included in the other and your algorithm will select the first because the word "retrospective" is longer than "study". How can you be sure that this doesn't happen ?
Edit: sorry, it will select the first because it comes first in the list since the lists ["retrospective", "cohort"] and ["cohort", "study"] have the same length. Don't you think it would be better in this case to generate a "retrospective cohort study" description ?

Gribouillis
Posting Maven
Moderator
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
 

thanks and you are right. The moment a file has more than one lines with the same lengths it will choose the first one. It was a risk i was willing to take :)

However after the implementation of this algorithm, i am going to normalize the study types so for example if i have [retrospective cohort] i will make it become a [retrospective cohort study].

Thank you for the code though. I will try it and let you know. Perhaps could be easier your way :)

doomas10
Newbie Poster
21 posts since Jul 2010
Reputation Points: 10
Solved Threads: 0
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: