FT CDS complement(join ( 14006...14068, 19351..20068))
FT /locus_tag= TP01_0004”
FT /note=”go function: nutrient reservoir activity [goid
FT 0045889]

The above statement is been read as a string and I would a REGEX as follows:

/^FT \s CDS \s \ / complement[0-9]/ # search line 1
/^FT \s .* \ / locus _tag = (.*)/ # search line 2
/^FT ‘s / \t///#/ ‘ .* \ / note (.*) / #search line 3


It seems not to be working well. Kindly help.

Recommended Answers

All 7 Replies

Hi
Can u clearly tell wt u want to do ?????

Hi Prakash,
What I want to do is :

a) The file with the above info. s opened and read as strings,
B) I need a REGEX that would be able to capture this block of
letters,

FT CDS complement(join ( 14006...14068, 19351..20068))
FT /locus_tag= TP01_0004”
FT /note=”go function: nutrient reservoir activity [goid
FT 0045889]

and copy the last two statements to a new file.
Thanks.

Member Avatar for onaclov2000

FT CDS complement(join ( 14006...14068, 19351..20068))
FT /locus_tag= TP01_0004”
FT /note=”go function: nutrient reservoir activity [goid
FT 0045889]

The above statement is been read as a string and I would a REGEX as follows:

/^FT \s CDS \s \ / complement[0-9]/ # search line 1
/^FT \s .* \ / locus _tag = (.*)/ # search line 2
/^FT ‘s / \t///#/ ‘ .* \ / note (.*) / #search line 3


It seems not to be working well. Kindly help.

First I would really like to see the output (If any).

Second I would recommend
for :FT CDS complement(join ( 14006...14068, 19351..20068))
1. /^FT \s+ CDS \s+ complement\(join\([0-9]+\)\)$/

Basically we want to escape each of the ('s and )'s also adding + to the \s will say match one or more (in case you have more then one space) as well as the character class [0-9] the plus will say match at least ONE number possibly more, not sure if the entire line should contain only that, but if so then you want to add a $ at the end to signify that it's the end of the string/line.

If that gets you in the right direction let me know if you still can't get the other two.

Hope that helped.

Hi Prakash and Onclav2000,

Thanks for your answers. I still having some problem in solving my problem. I have sent the structure of the file am dealing with and the code I have so far come up with using Onclav2000 REGEX that captures only the first line in the file.

The structure of the text file is as follows:

FT CDS complement(7216..17805)
FT /locus_tag="TP01_0003"
FT /codon_start=1
FT /protein_id="XP_765530.1"
FT /db_xref="GI:71031777"
FT /db_xref="GeneID:3502673"
FT gene complement(<7216..>17805)
FT /locus_tag="TP01_0003"
FT /db_xref="GeneID:3502673"
FT mRNA complement(<7216..>17805)
FT /locus_tag="TP01_0003"
FT /product="hypothetical telomeric SfiI fragment 20 protein
FT 3"
FT /transcript_id="XM_760437.1"
FT /db_xref="GI:71031776"
FT /db_xref="GeneID:3502673"
FT CDS complement(join(18028..18116,19351..20668))
FT /locus_tag="TP01_0004"
FT /note="go_function: nutrient reservoir activity [goid
FT 0045735]"

FT /codon_start=1
FT /protein_id="XP_765531.1"
FT /db_xref="GI:71031779"
FT /db_xref="GeneID:3503550"
FT gene complement(<18028..>20668)
FT /locus_tag="TP01_0004"
FT /db_xref="GeneID:3503550"
FT mRNA complement(join(<18028..18116,19351..>20668))
FT /locus_tag="TP01_0004"
FT /product="hypothetical protein"
FT /transcript_id="XM_760438.1"
FT /db_xref="GI:71031778"
FT /db_xref="GeneID:3503550"
FT CDS complement(20951..21967)
FT /locus_tag="TP01_0005"
FT /codon_start=1
FT /protein_id="XP_765532.1"
FT /db_xref="GI:71031781"
FT /db_xref="GeneID:3503551"
FT gene complement(<20951..>21967)
FT /locus_tag="TP01_0005"
FT /db_xref="GeneID:3503551"
FT mRNA complement(<20951..>21967)
FT /locus_tag="TP01_0005"
FT /product="hypothetical protein"
FT /transcript_id="XM_760439.1"
FT /db_xref="GI:71031780"
FT /db_xref="GeneID:3503551"


This is my code

#!/usr/bin/perl
$file = 'Muguga.embl ';

open (F, $file) || die ("Could not open $file!");

while ($line = <F>)
{
($field1,$field2,$field3,$field4) = split( "\t" , $line);

print "$field1 $field2 $field3 $field4 \n";
my $string = (FT CDS complement(join(18028..18116,19351..20668))); # string to be searched

if ($string = ~ m/^FT \s+ CDS \s+ complement\(join\([0-9]+\)\)$/)
#search for the first line highlighted in bold
{
print 'match'
} else{
print 'no match';
}
}
close (F);


My wish is to able to search for the lines that are in bold and print them out.

I will be grateful if you are able to help my code though am still a perl newbie.

Thanks.

Member Avatar for onaclov2000

Hi Prakash and Onclav2000,

Thanks for your answers. I still having some problem in solving my problem. I have sent the structure of the file am dealing with and the code I have so far come up with using Onclov2000 REGEX that captures only the first line in the file.

The structure of the text file is as follows:

FT CDS complement(7216..17805)
FT /locus_tag="TP01_0003"
FT /codon_start=1
FT /protein_id="XP_765530.1"
FT /db_xref="GI:71031777"
FT /db_xref="GeneID:3502673"
FT gene complement(<7216..>17805)
FT /locus_tag="TP01_0003"
FT /db_xref="GeneID:3502673"
FT mRNA complement(<7216..>17805)
FT /locus_tag="TP01_0003"
FT /product="hypothetical telomeric SfiI fragment 20 protein
FT 3"
FT /transcript_id="XM_760437.1"
FT /db_xref="GI:71031776"
FT /db_xref="GeneID:3502673"
FT CDS complement(join(18028..18116,19351..20668))
FT /locus_tag="TP01_0004"
FT /note="go_function: nutrient reservoir activity [goid
FT 0045735]"

FT /codon_start=1
FT /protein_id="XP_765531.1"
FT /db_xref="GI:71031779"
FT /db_xref="GeneID:3503550"
FT gene complement(<18028..>20668)
FT /locus_tag="TP01_0004"
FT /db_xref="GeneID:3503550"
FT mRNA complement(join(<18028..18116,19351..>20668))
FT /locus_tag="TP01_0004"
FT /product="hypothetical protein"
FT /transcript_id="XM_760438.1"
FT /db_xref="GI:71031778"
FT /db_xref="GeneID:3503550"
FT CDS complement(20951..21967)
FT /locus_tag="TP01_0005"
FT /codon_start=1
FT /protein_id="XP_765532.1"
FT /db_xref="GI:71031781"
FT /db_xref="GeneID:3503551"
FT gene complement(<20951..>21967)
FT /locus_tag="TP01_0005"
FT /db_xref="GeneID:3503551"
FT mRNA complement(<20951..>21967)
FT /locus_tag="TP01_0005"
FT /product="hypothetical protein"
FT /transcript_id="XM_760439.1"
FT /db_xref="GI:71031780"
FT /db_xref="GeneID:3503551"


This is my code

#!/usr/bin/perl
$file = 'Muguga.embl ';

open (F, $file) || die ("Could not open $file!");

while ($line = <F>)
{
($field1,$field2,$field3,$field4) = split( "\t" , $line);

print "$field1 $field2 $field3 $field4 \n";
my $string = (FT CDS complement(join(18028..18116,19351..20668))); # string to be searched

if ($string = ~ m/^FT \s+ CDS \s+ complement\(join\([0-9]+\)\)$/)
#search for the first line highlighted in bold
{
print 'match'
} else{
print 'no match';
}
}
close (F);


My wish is to able to search for the lines that are in bold and print them out.

I will be grateful if you are able to help my code though am still a perl newbie.

Thanks.

First things first, looking at your code, you're only searching on the string you provide
my $string = my $string = (FT CDS complement(join(18028..18116,19351..20668))); # string to be searched

So obviously you will only see that one and print it out....Try it with the input data,

Second, from what i'm understanding of your code, you're splitting each line up, so finding the data you want won't happen unless you look at the whole line, unless you modify the regex, and only look at the last field (for example).
So I would recommend changing to this (for the appropriate stuff):
while ($line = <F>)
{
print $line\n";

if ($line = ~ m/^FT \s+ CDS \s+ complement\(join\([0-9]+\)\)$/)
#search for the first line highlighted in bold
{
print 'match'
} else{
print 'no match';
}

#insert additional regex's here

}
close (F);

Third, Once you've done that....try writing a regex for the second line you're trying to get, if you have problems post it and i'll gladly take a look at your regex and make some suggestions....

Fourth, I would recommend putting a counter in, and when you print the "match" you print the Line number as well (makes troubleshooting easier)

Fifth...wow I talk alot...sorry....anyways. You can turn that one regex into something that will match each line you're looking for, but IMHO it'll make for a messy looking regex, but you can do it, then there are variables that the regex will return so if you stick the right things into parentheses you can extract them, but again, I'm for a cleaner easier to follow code so I would break up each particular line into each regex...unless they are REALLY similar but they really don't look like it.

Onaclov.

$string =~ /FT\s*CDS\s*\w+\(join\(.+\)\)/;

try this, it isnt tested.

$string =~ /FT\s*CDS\s*\w+\(join\(.+\)\)/;

try this, it isnt tested.

Thanks a lot Wickedxter, The code did work perfectly.
God Bless You.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.