1.11M Members

extract column from text file

 
0
 

Hi,

I'm a beginner at python and I'm trying to extract a specific column from a txt file.

In the file I want to extract the entire column pph2_prob (i.e. column 16). But I want to get all the values from that column without the headline pph2_prob.

How do I accomplish that?

This is a part of what the file looks like (it contains 2 rows and 20 col):

#o_snp_id o_acc o_pos o_aa1 o_aa2 snp_id acc pos aa1 aa2 nt1 nt2 prediction based_on effect pph2_class pph2_prob pph2_FPR pph2_TPR pph2_FDR
BSND_M1I Q8WZ55 1 M I BSND_M1I Q8WZ55 1 M I G T probably damaging alignment deleterious 0.999 0.00692 0.111 0.0222
BSND_M1K Q8WZ55 1 M K BSND_M1K Q8WZ55 1 M K T A probably damaging alignment deleterious 0.999 0.00692 0.111 0.0222


Best,
Sofia

 
0
 

Sofia, Can you please show some organized data info.?? dealing with what you got shown here is virtually not possible.

 
0
 

nt1 nt2 pred_effect pph2_class pph2_prob pph2_FPR pph2_TPR pph2_FDR
G T prob_damaging deleterious 0.999 0.00692 0.111 0.0222
T A prob_damaging deleterious 0.999 0.00692 0.111 0.0222
A T prob_damaging deleterious 0.997 0.0208 0.332 0.0665

I have shortened down the original file (which contains both more columns and rows). But the general problem is how to extract the data underneath column pph2_prob, without the header pph2_prob.

Best,
Sofia

 
0
 

As you can se now, in the small table there are 4 rows and 8 columns.

 
0
 
nt1  nt2   pred_effect   pph2_class   pph2_prob  pph2_FPR    pph2_TPR   pph2_FDR
G    T	 prob_damaging	 deleterious	0.999	   0.00692	0.111	0.0222
T    A	 prob_damaging	 deleterious	0.999	   0.00692	0.111	0.0222
A    T	 prob_damaging	 deleterious	0.997	    0.0208	0.332	0.0665

I have shortened down the original file (which contains both more columns and rows). But the general problem is how to extract the data underneath column pph2_prob, without the header pph2_prob.

Best,
Sofia

You must use code tags for data also to keep white space

 
0
 

Your header does not align exactly with data columns. I used tab splitting if tabs in line to allow multiple word columns including white space in data columns.

to_extract = 'pph2_prob'
# smaller sample data without tabs and original post data
for fn in ('genetic.txt', 'genetic2.txt'):
    print('Extracting from %r' % fn)
    with open(fn) as data:
        # analyze header
        header = next(data)
        # first post had tab separation, second not. Adapth to situation
        header = [h.strip() for h in header.split('\t' if '\t' in header else None)]
        print(header)
        ind = header.index(to_extract)
        print 'Extracting value %i from each column' % ind
        # tab separation if exist, otherwise all white space splits
        pph2_prob = [float(line.split('\t' if '\t' in line else None)[ind]) for line in data]
        print(pph2_prob)
        print('')
        
"""Output:
Extracting from 'genetic.txt'
['nt1', 'nt2', 'pred_effect', 'pph2_class', 'pph2_prob', 'pph2_FPR', 'pph2_TPR', 'pph2_FDR']
Extracting value 4 from each column
[0.00692, 0.00692, 0.0208]

Extracting from 'genetic2.txt'
['#o_snp_id', 'o_acc', 'o_pos', 'o_aa1', 'o_aa2', 'snp_id', 'acc', 'pos', 'aa1', 'aa2', 'nt1', 'nt2', 'prediction', 'based_on', 'effect', 'pph2_class', 'pph2_prob', 'pph2_FPR', 'pph2_TPR', 'pph2_FDR']
Extracting value 16 from each column
[0.999, 0.999]
"""
 
0
 

Hi,
thank you for your help, but it still doesn't work. Are you using both the files I put up on this thread in the code? Also, do I put my directory in the for loop "for fn in ('genetic.txt', 'genetic2.txt'):" instead of genetic.txt? Or just the file name? Because right now I get a error message saying "with open(fn) as data: IOerror: [Errno 21] Is a directory: '/'. What am I doing wrong?

Best,
Anna

 
0
 

If you have files in same directory as the script, file names suffice, otherwise you must use full path or cd to directory by os.chdir before loop.

 
0
 

Also, I tried this with a file someone else posted here a couple days ago, it had a different format, but I feel the idea here is the same, the way I broke it down is just different because of the format:

File:
115     139-28-4313     1056.30

135     706-02-6945      -99.06

143   595-74-5767     4289.07

155     972-87-1379     3300.26
#codes
def extract_second(file):
    col_2=[]
    with open(file) as f:
        for line in f:
            chars=[]
            line=line.split(' ')
            for char in line:
                if char not in ['',' ']:
                    chars.append(char)
            col_2.append(chars[1])
    return col_2



def extract_col(file,col):
    columns=[]
    with open(file) as f:
        for line in f:
            chars=[]
            line=line.split(' ')
            for char in line:
                if char not in ['',' ']:
                    chars.append(char)
            columns.append(chars[col])
    return columns
#results
>>>extract_second('help.txt')
['139-28-4313', '706-02-6945', '595-74-5767', '972-87-1379']
>>>
>>>extract_col('help.txt',0)
['115', '135', '143', '155']

not sure if that helps at all, hope it does. And this worked fine with what I had of yours but it could be much cleaner, I was just messing around with it.

def extract_col(file,col):
    column=[]
    with open(file) as f:
        for line in f:
            chars=[]
            line=line.split('\t')
            for char in line:
                if char not in ['',' ']:
                    chars.append(char)
            chars=' '.join(chars)
            chars=chars.split()
            column.append(chars[col])
    return column

output

extract_col('help2.txt',4)
['pph2_prob', '0.999', '0.999', '0.997']
 
0
 

Thank you so much. I managed to solve my problem!

Question Answered as of 2 Years Ago by pyTony, richieking and pyguy62
 
0
 

Thank you so much. I managed to solve my problem!

can you show us your solution?

You
This question has already been solved: Start a new discussion instead
Post:
Start New Discussion
Tags Related to this Article