A protein entry on swissprot database be something like this:

ID ATF6A_HUMAN
AC P18850; O15139; Q5VW62; Q6IPB5; Q9UEC9;
DT 01-NOV-1990, integrated into UniProtKB/Swiss-Prot.
DE AltName: Full=Activating transcription factor 6 alpha;
DE Short=ATF6-alpha;
OS Homo sapiens (Human).
RN [10]
RP REVIEW.
RX MEDLINE=21376119; PubMed=11483355; DOI=10.1016/S0378-1119(01)00551-0;
RA Hai T., Hartman M.G.;
RT "The molecular biology and nomenclature of the activating
RT transcription factor/cAMP responsive element binding family of
RT transcription factors: activating transcription factor proteins and
RT homeostasis.";
RL Gene 273:1-11(2001).
DR NextBio; 43639; -.
DR PMAP-CutDB; P18850; -.
DR ArrayExpress; P18850; -.
DR Bgee; P18850; -.
DR CleanEx; HS_ATF6; -.
DR GermOnline; ENSG00000118217; Homo sapiens.
DR Pfam; PF00170; bZIP_1; 1.
DR PROSITE; PS50217; BZIP; 1.
DR PROSITE; PS00036; BZIP_BASIC; 1.
PE 1: Evidence at protein level;
KW Activator; Complete proteome; DNA-binding; Endoplasmic reticulum;
KW Glycoprotein; Membrane; Nucleus; Phosphoprotein; Polymorphism;
KW Signal-anchor; Transcription; Transcription regulation; Transmembrane;
KW Unfolded protein response.
//

I wish to parse through the protein database to get all protein names and Pfam numbers that are "human" "transcription factors" AND also a "DNA-binding", ignoring all other information.
My protocol is as followed:
Find "ID" line.
Find "OS" line and see if it contains the word "HUMAN".
See if there is any Pfam number in "DR" lines.
See if the terms "transcription" and "DNA-binding" appear in "KW" lines.
If the three conditions are met, then print out the result like this:
$ID, $Pfam_number1, $Pfam_number2,...(if more Pfam numbers exist)

It should be a pretty easy one, but I am not sure how to write a script on this. It has puzzled me for days. Can anyone please help?
I will not need a full script, just a main construction of the script would be very helpful.

Thanks.

Hi Jacqueline,
I wasn't too sure whether you wanted to checkID for 'Human' as well; anyway, I suppose you'll ba able to sort it out.
Here's a script that should do what you want. The output shall neeed a little cleaning, though.

Hope this helps.
Reiner

#!perl -w
use strict;
my $entry=<<ENTRY;
ID ATF6A_HUMAN
AC P18850; O15139; Q5VW62; Q6IPB5; Q9UEC9;
DT 01-NOV-1990, integrated into UniProtKB/Swiss-Prot.
DE AltName: Full=Activating transcription factor 6 alpha;
DE Short=ATF6-alpha;
OS Homo sapiens (Human).
RN [10]
RP REVIEW.
RX MEDLINE=21376119; PubMed=11483355; DOI=10.1016/S0378-1119(01)00551-0;
RA Hai T., Hartman M.G.;
RT "The molecular biology and nomenclature of the activating
RT transcription factor/cAMP responsive element binding family of
RT transcription factors: activating transcription factor proteins and
RT homeostasis.";
RL Gene 273:1-11(2001).
DR NextBio; 43639; -.
DR PMAP-CutDB; P18850; -.
DR ArrayExpress; P18850; -.
DR Bgee; P18850; -.
DR CleanEx; HS_ATF6; -.
DR GermOnline; ENSG00000118217; Homo sapiens.
DR Pfam; PF00170; bZIP_1; 1.
DR PROSITE; PS50217; BZIP; 1.
DR PROSITE; PS00036; BZIP_BASIC; 1.
PE 1: Evidence at protein level;
KW Activator; Complete proteome; DNA-binding; Endoplasmic reticulum;
KW Glycoprotein; Membrane; Nucleus; Phosphoprotein; Polymorphism;
KW Signal-anchor; Transcription; Transcription regulation; Transmembrane;
KW Unfolded protein response.
ENTRY
my $DEBUG=0;
my @array=split /\n/, $entry;
#print scalar @array;
my @searchTerms=qw/transcription DNA-binding/;
my ($id)=grep /ID/, @array;#assuming there's just one id line in the record
my ($os)=grep /OS/, @array;#assuming there's just one os line in the record
if ($id=~/HUMAN/i or $os=~/HUMAN/i){
	print "$id : $os is human.\n"  if $DEBUG;
	
	#get DR lines
	my @dr=grep /DR/, @array;
	#search for pfam
	my @pfams=();
	foreach my $dr (@dr){
		if ($dr=~/pfam/i){
			push @pfams, $dr;
		}
	}
	if(scalar @pfams >0){ #nex step if we found some
		print "found pfams @pfams \n"  if $DEBUG;
		#get the kw lines
		my @kw=grep /KW/,@array;
		my @terms=();
		foreach my $kw (@kw){
			foreach my $term (@searchTerms){
				if ($kw=~/$term/i){
					push @terms, $term;
					print "found term $term\n" if $DEBUG;
				}
			}
		}
		#check if we found terms transcription DNA-binding
		if(scalar @terms){
			#yes, thus print output line
			print "$id @pfams \n";
		}
	}
}

Edited 7 Years Ago by Reiner Dieden: n/a

This article has been dead for over six months. Start a new discussion instead.