1,105,546 Community Members

extract column from text file

Member Avatar
sofia85
Junior Poster in Training
52 posts since Oct 2011
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

Hi,

I'm a beginner at python and I'm trying to extract a specific column from a txt file.

In the file I want to extract the entire column pph2_prob (i.e. column 16). But I want to get all the values from that column without the headline pph2_prob.

How do I accomplish that?

This is a part of what the file looks like (it contains 2 rows and 20 col):

#o_snp_id o_acc o_pos o_aa1 o_aa2 snp_id acc pos aa1 aa2 nt1 nt2 prediction based_on effect pph2_class pph2_prob pph2_FPR pph2_TPR pph2_FDR
BSND_M1I Q8WZ55 1 M I BSND_M1I Q8WZ55 1 M I G T probably damaging alignment deleterious 0.999 0.00692 0.111 0.0222
BSND_M1K Q8WZ55 1 M K BSND_M1K Q8WZ55 1 M K T A probably damaging alignment deleterious 0.999 0.00692 0.111 0.0222


Best,
Sofia

Member Avatar
richieking
Posting Shark
926 posts since Jun 2009
Reputation Points: 40 [?]
Q&As Helped to Solve: 172 [?]
Skill Endorsements: 3 [?]
 
0
 

Sofia, Can you please show some organized data info.?? dealing with what you got shown here is virtually not possible.

Member Avatar
sofia85
Junior Poster in Training
52 posts since Oct 2011
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

nt1 nt2 pred_effect pph2_class pph2_prob pph2_FPR pph2_TPR pph2_FDR
G T prob_damaging deleterious 0.999 0.00692 0.111 0.0222
T A prob_damaging deleterious 0.999 0.00692 0.111 0.0222
A T prob_damaging deleterious 0.997 0.0208 0.332 0.0665

I have shortened down the original file (which contains both more columns and rows). But the general problem is how to extract the data underneath column pph2_prob, without the header pph2_prob.

Best,
Sofia

Member Avatar
sofia85
Junior Poster in Training
52 posts since Oct 2011
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

As you can se now, in the small table there are 4 rows and 8 columns.

Member Avatar
pyTony
pyMod
6,103 posts since Apr 2010
Reputation Points: 818 [?]
Q&As Helped to Solve: 1,056 [?]
Skill Endorsements: 42 [?]
Moderator
Featured
 
0
 
nt1  nt2   pred_effect   pph2_class   pph2_prob  pph2_FPR    pph2_TPR   pph2_FDR
G    T	 prob_damaging	 deleterious	0.999	   0.00692	0.111	0.0222
T    A	 prob_damaging	 deleterious	0.999	   0.00692	0.111	0.0222
A    T	 prob_damaging	 deleterious	0.997	    0.0208	0.332	0.0665

I have shortened down the original file (which contains both more columns and rows). But the general problem is how to extract the data underneath column pph2_prob, without the header pph2_prob.

Best,
Sofia

You must use code tags for data also to keep white space

Member Avatar
pyTony
pyMod
6,103 posts since Apr 2010
Reputation Points: 818 [?]
Q&As Helped to Solve: 1,056 [?]
Skill Endorsements: 42 [?]
Moderator
Featured
 
0
 

Your header does not align exactly with data columns. I used tab splitting if tabs in line to allow multiple word columns including white space in data columns.

to_extract = 'pph2_prob'
# smaller sample data without tabs and original post data
for fn in ('genetic.txt', 'genetic2.txt'):
    print('Extracting from %r' % fn)
    with open(fn) as data:
        # analyze header
        header = next(data)
        # first post had tab separation, second not. Adapth to situation
        header = [h.strip() for h in header.split('\t' if '\t' in header else None)]
        print(header)
        ind = header.index(to_extract)
        print 'Extracting value %i from each column' % ind
        # tab separation if exist, otherwise all white space splits
        pph2_prob = [float(line.split('\t' if '\t' in line else None)[ind]) for line in data]
        print(pph2_prob)
        print('')
        
"""Output:
Extracting from 'genetic.txt'
['nt1', 'nt2', 'pred_effect', 'pph2_class', 'pph2_prob', 'pph2_FPR', 'pph2_TPR', 'pph2_FDR']
Extracting value 4 from each column
[0.00692, 0.00692, 0.0208]

Extracting from 'genetic2.txt'
['#o_snp_id', 'o_acc', 'o_pos', 'o_aa1', 'o_aa2', 'snp_id', 'acc', 'pos', 'aa1', 'aa2', 'nt1', 'nt2', 'prediction', 'based_on', 'effect', 'pph2_class', 'pph2_prob', 'pph2_FPR', 'pph2_TPR', 'pph2_FDR']
Extracting value 16 from each column
[0.999, 0.999]
"""
Member Avatar
sofia85
Junior Poster in Training
52 posts since Oct 2011
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

Hi,
thank you for your help, but it still doesn't work. Are you using both the files I put up on this thread in the code? Also, do I put my directory in the for loop "for fn in ('genetic.txt', 'genetic2.txt'):" instead of genetic.txt? Or just the file name? Because right now I get a error message saying "with open(fn) as data: IOerror: [Errno 21] Is a directory: '/'. What am I doing wrong?

Best,
Anna

Member Avatar
pyTony
pyMod
6,103 posts since Apr 2010
Reputation Points: 818 [?]
Q&As Helped to Solve: 1,056 [?]
Skill Endorsements: 42 [?]
Moderator
Featured
 
0
 

If you have files in same directory as the script, file names suffice, otherwise you must use full path or cd to directory by os.chdir before loop.

Member Avatar
pyguy62
Posting Whiz
346 posts since Aug 2011
Reputation Points: 23 [?]
Q&As Helped to Solve: 19 [?]
Skill Endorsements: 0 [?]
 
0
 

Also, I tried this with a file someone else posted here a couple days ago, it had a different format, but I feel the idea here is the same, the way I broke it down is just different because of the format:

File:
115     139-28-4313     1056.30

135     706-02-6945      -99.06

143   595-74-5767     4289.07

155     972-87-1379     3300.26
#codes
def extract_second(file):
    col_2=[]
    with open(file) as f:
        for line in f:
            chars=[]
            line=line.split(' ')
            for char in line:
                if char not in ['',' ']:
                    chars.append(char)
            col_2.append(chars[1])
    return col_2



def extract_col(file,col):
    columns=[]
    with open(file) as f:
        for line in f:
            chars=[]
            line=line.split(' ')
            for char in line:
                if char not in ['',' ']:
                    chars.append(char)
            columns.append(chars[col])
    return columns
#results
>>>extract_second('help.txt')
['139-28-4313', '706-02-6945', '595-74-5767', '972-87-1379']
>>>
>>>extract_col('help.txt',0)
['115', '135', '143', '155']

not sure if that helps at all, hope it does. And this worked fine with what I had of yours but it could be much cleaner, I was just messing around with it.

def extract_col(file,col):
    column=[]
    with open(file) as f:
        for line in f:
            chars=[]
            line=line.split('\t')
            for char in line:
                if char not in ['',' ']:
                    chars.append(char)
            chars=' '.join(chars)
            chars=chars.split()
            column.append(chars[col])
    return column

output

extract_col('help2.txt',4)
['pph2_prob', '0.999', '0.999', '0.997']
Member Avatar
sofia85
Junior Poster in Training
52 posts since Oct 2011
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

Thank you so much. I managed to solve my problem!

Question Answered as of 2 Years Ago by pyTony, richieking and pyguy62
Member Avatar
pyguy62
Posting Whiz
346 posts since Aug 2011
Reputation Points: 23 [?]
Q&As Helped to Solve: 19 [?]
Skill Endorsements: 0 [?]
 
0
 

Thank you so much. I managed to solve my problem!

can you show us your solution?

You
This question has already been solved: Start a new discussion instead
Post:
Start New Discussion
Tags Related to this Article