1.11M Members

Parsing a text file in multiple lines

 
0
 

Helo all, I wanted to parse EMBL format like file to fasta. i cannot use bioperl because this is not complete EMBL format. so please suggest me how to get this done..

ID   013789-0068 
PS   TBD 
OO   huringiensis 
OS   ringiensis 
OX 
SI   68 
RA 
RL   2010. OKAYAMA UNIVERSITY,JAPAN LAMB CO LTD 
FT   source          1..1176 
MT 
AC   67106 
SV 
CT 
PN   013789 
PT   PROTEIN PRODUCTION METHOD, FUSION PROTEIN, AND ANTISERUM 
PA   AMA UNIVERSITY,JAPAN LAMB CO LTD. 
PI   HAYAKAWA TORU (JP) SAKAI, HIROSHI, HAYAKAWA, TORU 
P8 
P4   10013789 
P5   0 
PC   International Classification: \nUS Classification: \nEuropean Classification: C12N15/62; C07K14/47A25 
PR   80199166; 
PE   199166 
AN   09JP63603 
KC   1 
P1   ng the DNA into a host bacterium to transform the host bacterium; and (c) causing the expression of the fusion protein in the transformed host bacterium.; The method may further comprise a step of removing the peptide chain (B) from the fusion protein. \n \n 
P7 
P9   112 
PO 
PM   10013789; 
PB   10013789 
PQ   10013789; 
EM   esentative 
W1   PRT 
D1   0204 
D2   0217 
D3   0730 
D4   0801 
D5   0204 
HL   [L[P9_GQ;0;3,WO2010013789,45,67]] [L[PM_PN_GQNUC;0;12,WO2010013789]] [L[PQ_PN_GQNUC;0;12,WO2010013789]] 
CC   mer C1-1-f FH   Key             Location/Qualifiers Copyright (c)Inc. 2011 
LS   Application 
L2   Publ. Of int. appl. w4 
 
  MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI 
  EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV 
  YVQAANLHLSVLRDVSVFGQRWGFDAATINSRYNDLTRLIGNYTDYAVRWYNTGLERVWGPDSRDWVRYNQFRRELTLTV 
  LDIVALFSNYDSRRYPIRTVSQLTREIYTNPVLENFDGSFRGMAQRIEQNIRQPHLMDILNSITIYTDVHRGFNYWSGHQ 
  ITASPVGFSGPEFAFPLFGNAGNAAPPVLVSLTGLGIFRTLSSPLYRRIILGSGPNNQELFVLDGTEFSFASLTTNLPST 
  IYRQRGTVDSLDVIPPQDNSVPPRAGFSHRLSHVTMLSQAAGAVYTLRAPTFSWQHRSAEFNNIIPSSQITQIPLTKSTN 
  LGSGTSVVKGPGFTGGDILRRTSPGQISTLRVNITAPLSQRYRVRIRYASTTNLQFHTSIDGRPINQGNFSATMSSGSNL 
  QSGSFRTVGFTTPFNFSNGSSVFTLSAHVFNSGNEVYIDRIEFVPAEVTFEAEYDLERAQKAVNELFTSSNQIGLKTDVT 
  DYHIDQVSNLVECLSDEFCLDEKQELSEKVKHAKRLSDERNLLQDPNFRGINRQLDRGWRGSTDITIQGGDDVFKENYVT 
  LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC 
  APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR 
 
// 
 
ID   0223489-0068 
PS   TBD 
OO   huringiensis 
OS   ringiensis 
OX 
SI   68 
RA 
RL   2010. OKAYAMA UNIVERSITY,JAPAN LAMB CO LTD 
FT   source          1..1176 
MT 
AC   67106 
SV 
CT 
PN   013789 
PT   PRN METHOD, FUSION PROTEIN, AND ANTISERUM 
PA   AMERSITY,JAMB CO LTD. 
PI   HAYAKAWA TORU (JP) SAKAI, HIROSHI, HAYAKAWA, TORU 
P8 
P4   10013789 
P5   0 
PC   International Classification: \nUS Classification: \nEuropean Classification: C12N15/62; C07K14/47A25 
PR   80199166; 
PE   199166 
AN   09JP63603 
KC   1 
P1   ng the DNA into a host bacterium to transform the host bacterium; and (c) causing the expression of the fusion protein in the transformed host bacterium.; The method may further comprise a step of removing the peptide chain (B) from the fusion protein. \n \n 
P7 
P9   112 
PO 
PM   10013789; 
PB   10013789 
PQ   10013789; 
EM   esentative 
W1   PRT 
D1   0204 
D2   0217 
D3   0730 
D4   0801 
D5   0204 
HL   [L[P9_GQ;0;3,WO2010013789,45,67]] [L[PM_PN_GQNUC;0;12,WO2010013789]] [L[PQ_PN_GQNUC;0;12,WO2010013789]] 
CC   mer C1-1-f FH   Key             Location/Qualifiers Copyright (c)Inc. 2011 
LS   Application 
L2   Publ. Of int. appl. w4 
 
  VLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI 
  EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV

The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two EMBL genes.

>013789-0068 ;  PROTEIN PRODUCTION METHOD, FUSION PROTEIN, AND ANTISERUM PA ;   AMA UNIVERSITY,JAPAN LAMB CO LTD. 
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI 
  EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV 
  YVQAANLHLSVLRDVSVFGQRWGFDAATINSRYNDLTRLIGNYTDYAVRWYNTGLERVWGPDSRDWVRYNQFRRELTLTV 
  LDIVALFSNYDSRRYPIRTVSQLTREIYTNPVLENFDGSFRGMAQRIEQNIRQPHLMDILNSITIYTDVHRGFNYWSGHQ 
  ITASPVGFSGPEFAFPLFGNAGNAAPPVLVSLTGLGIFRTLSSPLYRRIILGSGPNNQELFVLDGTEFSFASLTTNLPST 
  IYRQRGTVDSLDVIPPQDNSVPPRAGFSHRLSHVTMLSQAAGAVYTLRAPTFSWQHRSAEFNNIIPSSQITQIPLTKSTN 
  LGSGTSVVKGPGFTGGDILRRTSPGQISTLRVNITAPLSQRYRVRIRYASTTNLQFHTSIDGRPINQGNFSATMSSGSNL 
  QSGSFRTVGFTTPFNFSNGSSVFTLSAHVFNSGNEVYIDRIEFVPAEVTFEAEYDLERAQKAVNELFTSSNQIGLKTDVT 
  DYHIDQVSNLVECLSDEFCLDEKQELSEKVKHAKRLSDERNLLQDPNFRGINRQLDRGWRGSTDITIQGGDDVFKENYVT 
  LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC 
  APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR 
 
>0223489-0068 ; PRN METHOD, FUSION PROTEIN, AND ANTISERUM PA  ; AMERSITY,JAMB CO LTD. 
VLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIMNSALTTAIPLLAVQREEMRIQLE 
  EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV 
  LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC 
  APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR
 
0
 

You can use sed or awk.
Perhaps awk is more readable if you're new to shell scripting.

Create one RE each for ID, PA, PT and Sequence. And in the body print the output.
something like:

awk '
           /^ID / { print $2,";" }
           /..../ { print ... }
'

OR

Start your processing with an RE that matches your first line " /^ID / " in this case I think. Then in the body use getline() to keep reading input and printing it in required format until you've reached the end of "Sequence".

First one is preferred compared to second one as second one is trying to do what awk would do for you.

PS: getline() may not be available in all awks, but it come in gawk and nawk for sure.

You
This article has been dead for over six months: Start a new discussion instead
Post:
Start New Discussion
View similar articles that have also been tagged: