Hi all

How can I create a csv file with header. I have a text file with several number of blocks staring from "//" and ending a block with "//". I have attached a sample file.
I want to use first column of this text as a header of csv and append associated values in it. If any of the header missing in a block create new and append.
for example from text below, I want a csv with AC, ID, FA, OS, SF, BS, GE as an header and their values under that header. Could anyone help me with this. I have tried doing this using code at the end. But not getting exactly what I want.

//
AC T876837378768
XX
ID T876837378768
XX
DT 16.09.1996 (created); ewi.
CO Copyright (C), Biobase GmbH.
XX
FA MNG345
XX
OS human, Homo sapiens
OC eukaryota; animalia; metazoa; chordata; vertebrata; tetrapoda; mammalia; eutheria; primates
XX
SF similar to MNG;
XX
FF induced by interferon-alpha (15-30'), inhibited by 2-AP;
XX
BS R02116; AAF$CONS; Quality: 6.
BS R03064; HS$GBP_02; Quality: 6; GBP, G000264; human, Homo sapiens.
XX
DR TRANSPATH: MO000026034.
XX
RN [1]; RE0000446.
RX PUBMED: 1901265.
RA Decker T., Lew D. J., Mirkowitch J., Darnell J. E.
RT Cytoplasmic activation of GAF, an IFN-gamma-regulated DNA-binding factor
RL EMBO J. 10:927-932 (1991).
RN [2]; RE0001471.
RX PUBMED: 1833631.
RA Decker T., Lew D. J., Darnell J. E.
RT Two distinct alpha-interferon-dependent signal transduction pathways may contribute to activation of transcription of the guanylate-binding protein gene
RL Mol. Cell. Biol. 11:5147-5153 (1991).
XX
//

tfid,fa,os,ge,osm,ins,inm = "","","","","","",""
for line in f1 :
    r1 = line.split()
    if line.startswith("ID"):
        tfid = r1[1]
        #print a
    if line.startswith("FA"):
        fa = r1[1]
        #print b
    if line.startswith("OS")and line.endswith("sapiens\n"):
        os = " ".join(r1[1:])
        #print os
    if line.startswith("GE"):
        ge = " ".join(r1[1:3])
        #print ge
    if line.startswith("OS")and line.endswith("Mammalia\n"):
        osm  = r1[1]
         #print c
    if line.startswith("IN") and line.endswith("sapiens.\n"):
        ins ="\t".join(r1[1:3])
        #print g
    if line.startswith("IN") and line.endswith("Mammalia.\n"):
        inm = "\t".join(r1[1:])
    if line.startswith("//"):
        tftable = os+"\t"+tfid+"\t"+fa+"\t"+"\t"+ge+"\t"+osm+"\t"+ins+"\t"+inm+"\n"
        
        
        #tfid,fa,os,ge,osm,ins,inm = "","","","","","",""

Can you provide a sample file.

Your sample doesn't even have the GE.

I changed the second BS tag to GE.

f_in = open('blocks.txt').read()
f_out = open('output.csv', 'w')
f_out.write('AC\tID\tFA\tOS\tSF\tBS\tGE\n')

blocks = [x for x in f_in.split('//') if x]
for item in blocks:
    infos = [x for x in item.split('\n') if x and x != 'XX']
    for field in infos:
        if field.startswith('AC'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('ID'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('FA'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('OS'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('SF'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('BS'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('GE'):
            f_out.write('%s\t\n' % field[3:])

f_out.close()

Edited 6 Years Ago by Beat_Slayer: n/a

Here is my preprosessing routine for data to put it in dict from where it is simple to output the data to file open as of with of.write()

filename='biodata.txt'
datadict= dict()

with open('data.csv','w') as of:
    data = ((ind,textline.strip().split(' ',1))
            for ind,block in enumerate(open(filename).read().split('//'))
            for textline in block.split('XX')
            if ' ' in textline)
    for ind,(key,info) in data:
        datadict[ind,key]=info.splitlines()
    for d,value in  datadict.items():
        print "datadict%s = %s" % (list(d),''.join(value))
    ## outputing to of here

Python usually has ways of replacing a bunch of if/elif/else statements.

for field in infos:
        if field.startswith('AC'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('ID'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('FA'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('OS'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('SF'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('BS'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('GE'):
            f_out.write('%s\t\n' % field[3:])

    ## ---------  replace with  ----------
    for field in infos:
        test_2 = field[0:2]
        if test_2 in ["AC", "ID", "FA", "OS", "SF", "BS", "GE"]:  
            f_out.write('%s\t\n' % field[3:])

Edited 6 Years Ago by woooee: n/a

Hi Some of the blocks do not have GE tag.

It is not giving output. Its showing headers and first row. Header not matched with the entries in the block.

Can you provide a sample file.

Your sample doesn't even have the GE.

I changed the second BS tag to GE.

f_in = open('blocks.txt').read()
f_out = open('output.csv', 'w')
f_out.write('AC\tID\tFA\tOS\tSF\tBS\tGE\n')

blocks = [x for x in f_in.split('//') if x]
for item in blocks:
    infos = [x for x in item.split('\n') if x and x != 'XX']
    for field in infos:
        if field.startswith('AC'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('ID'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('FA'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('OS'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('SF'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('BS'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('GE'):
            f_out.write('%s\t\n' % field[3:])

f_out.close()

I believe the problem relys on input file.

It works here with the sample file you provided.

Can you provide more data and info.

Cheers

Hi Please find attached sample file with more data in it. Actual file looks exactly like this with more than 5000 entries.

Thanks.


I believe the problem relys on input file.

It works here with the sample file you provided.

Can you provide more data and info.

Cheers

It works with your file again. :)

f_in = open('blocks.txt').read()
f_out = open('output.csv', 'w')

f_out.write('AC\tID\tFA\tOS\tSF\tBS\tGE\n')

blocks = [x for x in f_in.split('//') if x]

for item in blocks:
    infos = [x for x in item.split('\n') if x and x != 'XX']
    AC = ''
    ID = ''
    FA = ''
    OS = ''
    SF = ''
    BS = ''
    GE = ''
    for field in infos:
        if field.startswith('AC'):
            AC += ' ' + field[3:]
        elif field.startswith('ID'):
            ID += ' ' + field[3:]
        elif field.startswith('FA'):
            FA += ' ' + field[3:]
        elif field.startswith('OS'):
            OS += ' ' + field[3:]
        elif field.startswith('SF'):
            SF += ' ' + field[3:]
        elif field.startswith('BS'):
            BS += ' ' + field[3:]
        elif field.startswith('GE'):
            GE += ' ' + field[3:]

    f_out.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (AC, ID, FA, OS, SF, BS, GE))
    
f_out.close()

Cheers and Happy coding

Edited 6 Years Ago by Beat_Slayer: n/a

Hiiii...

Perfect!!!! working.

File in csv format is not giving proper output. Instead of csv txt format working well..

Thanks a lot for your help!

Can you provide a sample file.

Your sample doesn't even have the GE.

I changed the second BS tag to GE.

f_in = open('blocks.txt').read()
f_out = open('output.csv', 'w')
f_out.write('AC\tID\tFA\tOS\tSF\tBS\tGE\n')

blocks = [x for x in f_in.split('//') if x]
for item in blocks:
    infos = [x for x in item.split('\n') if x and x != 'XX']
    for field in infos:
        if field.startswith('AC'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('ID'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('FA'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('OS'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('SF'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('BS'):
            f_out.write('%s\t' % field[3:])
        elif field.startswith('GE'):
            f_out.write('%s\t\n' % field[3:])

f_out.close()
This question has already been answered. Start a new discussion instead.