Hi,

I'm a beginner at python and I'm trying to extract a specific column from a txt file.

In the file I want to extract the entire column pph2_prob (i.e. column 16). But I want to get all the values from that column without the headline pph2_prob.

How do I accomplish that?

This is a part of what the file looks like (it contains 2 rows and 20 col):

#o_snp_id o_acc o_pos o_aa1 o_aa2 snp_id acc pos aa1 aa2 nt1 nt2 prediction based_on effect pph2_class pph2_prob pph2_FPR pph2_TPR pph2_FDR
BSND_M1I Q8WZ55 1 M I BSND_M1I Q8WZ55 1 M I G T probably damaging alignment deleterious 0.999 0.00692 0.111 0.0222
BSND_M1K Q8WZ55 1 M K BSND_M1K Q8WZ55 1 M K T A probably damaging alignment deleterious 0.999 0.00692 0.111 0.0222


Best,
Sofia

Recommended Answers

All 10 Replies

Sofia, Can you please show some organized data info.?? dealing with what you got shown here is virtually not possible.

nt1 nt2 pred_effect pph2_class pph2_prob pph2_FPR pph2_TPR pph2_FDR
G T prob_damaging deleterious 0.999 0.00692 0.111 0.0222
T A prob_damaging deleterious 0.999 0.00692 0.111 0.0222
A T prob_damaging deleterious 0.997 0.0208 0.332 0.0665

I have shortened down the original file (which contains both more columns and rows). But the general problem is how to extract the data underneath column pph2_prob, without the header pph2_prob.

Best,
Sofia

As you can se now, in the small table there are 4 rows and 8 columns.

nt1  nt2   pred_effect   pph2_class   pph2_prob  pph2_FPR    pph2_TPR   pph2_FDR
G    T	 prob_damaging	 deleterious	0.999	   0.00692	0.111	0.0222
T    A	 prob_damaging	 deleterious	0.999	   0.00692	0.111	0.0222
A    T	 prob_damaging	 deleterious	0.997	    0.0208	0.332	0.0665

I have shortened down the original file (which contains both more columns and rows). But the general problem is how to extract the data underneath column pph2_prob, without the header pph2_prob.

Best,
Sofia

You must use code tags for data also to keep white space

Your header does not align exactly with data columns. I used tab splitting if tabs in line to allow multiple word columns including white space in data columns.

to_extract = 'pph2_prob'
# smaller sample data without tabs and original post data
for fn in ('genetic.txt', 'genetic2.txt'):
    print('Extracting from %r' % fn)
    with open(fn) as data:
        # analyze header
        header = next(data)
        # first post had tab separation, second not. Adapth to situation
        header = [h.strip() for h in header.split('\t' if '\t' in header else None)]
        print(header)
        ind = header.index(to_extract)
        print 'Extracting value %i from each column' % ind
        # tab separation if exist, otherwise all white space splits
        pph2_prob = [float(line.split('\t' if '\t' in line else None)[ind]) for line in data]
        print(pph2_prob)
        print('')
        
"""Output:
Extracting from 'genetic.txt'
['nt1', 'nt2', 'pred_effect', 'pph2_class', 'pph2_prob', 'pph2_FPR', 'pph2_TPR', 'pph2_FDR']
Extracting value 4 from each column
[0.00692, 0.00692, 0.0208]

Extracting from 'genetic2.txt'
['#o_snp_id', 'o_acc', 'o_pos', 'o_aa1', 'o_aa2', 'snp_id', 'acc', 'pos', 'aa1', 'aa2', 'nt1', 'nt2', 'prediction', 'based_on', 'effect', 'pph2_class', 'pph2_prob', 'pph2_FPR', 'pph2_TPR', 'pph2_FDR']
Extracting value 16 from each column
[0.999, 0.999]
"""

Hi,
thank you for your help, but it still doesn't work. Are you using both the files I put up on this thread in the code? Also, do I put my directory in the for loop "for fn in ('genetic.txt', 'genetic2.txt'):" instead of genetic.txt? Or just the file name? Because right now I get a error message saying "with open(fn) as data: IOerror: [Errno 21] Is a directory: '/'. What am I doing wrong?

Best,
Anna

If you have files in same directory as the script, file names suffice, otherwise you must use full path or cd to directory by os.chdir before loop.

Also, I tried this with a file someone else posted here a couple days ago, it had a different format, but I feel the idea here is the same, the way I broke it down is just different because of the format:

File:
115     139-28-4313     1056.30

135     706-02-6945      -99.06

143   595-74-5767     4289.07

155     972-87-1379     3300.26
#codes
def extract_second(file):
    col_2=[]
    with open(file) as f:
        for line in f:
            chars=[]
            line=line.split(' ')
            for char in line:
                if char not in ['',' ']:
                    chars.append(char)
            col_2.append(chars[1])
    return col_2



def extract_col(file,col):
    columns=[]
    with open(file) as f:
        for line in f:
            chars=[]
            line=line.split(' ')
            for char in line:
                if char not in ['',' ']:
                    chars.append(char)
            columns.append(chars[col])
    return columns
#results
>>>extract_second('help.txt')
['139-28-4313', '706-02-6945', '595-74-5767', '972-87-1379']
>>>
>>>extract_col('help.txt',0)
['115', '135', '143', '155']

not sure if that helps at all, hope it does. And this worked fine with what I had of yours but it could be much cleaner, I was just messing around with it.

def extract_col(file,col):
    column=[]
    with open(file) as f:
        for line in f:
            chars=[]
            line=line.split('\t')
            for char in line:
                if char not in ['',' ']:
                    chars.append(char)
            chars=' '.join(chars)
            chars=chars.split()
            column.append(chars[col])
    return column

output

extract_col('help2.txt',4)
['pph2_prob', '0.999', '0.999', '0.997']

Thank you so much. I managed to solve my problem!

Thank you so much. I managed to solve my problem!

can you show us your solution?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.