Hi everybody..Here's an interesting problem to solve. I have a text file like this (also attached):

>first
TTCCCAAAAAAGACCTACTAAGTCAAGCGGATGCGTTTTGTGTCTTATGG
AAAGTCCCTGACGGATACGAGGCTTTGGGTGATTCGGTACGAATGATTCG
GTTACCAGAACTTACCGAAGAAGAAATGGGACGAACCGAGGTTTCTCGTT
CGTGTGCTAATCCTACATTCAAACATCGATTTCGATCAGAGTTTGTTTTT
CATGAAGAACAGACATTCGTATTACGTGTTTACGATGAAGATTTGAGGTA
>firsta
TTCCCAAAAAAGACCTACTAAGTCAAGCGGATGCGTTTTGTGTCTTATGG
AAAGTCCCTGACGGATACGAGGCTTTGG----------------------
-----------------AAGAAGAAATGGGACGAACCGAGGTTTCTCGTT
CGTGTGCTAATCCTACATTCAAACATCGATTTCGATCAGAGTTT------
CATGAAGAACAGACATTCGTATTACGTGTTTACGATGAAGATTTGAGGTA

Both >first and >firsta containing same characters except the part with hyphens. Now is it possible to write a perl script that would extract the text starting after >firsta and before the start of - for each line? Also, would it be possible to extract the unmatched text from >first?
Please note that both >first and >firsta are in the same text file and other similar text files which I am using might contain more lines like these.
Thanks a lot in advance..

Recommended Answers

All 8 Replies

try This code,

open my $txt, "<", "$ARGV[0]";
read $txt, my $cont, -s $txt;
close $txt;
my $first=$1 if ($cont=~m#\Q>first\E$(.*?)\Q>firsta\E$#sm);
my $firsta=$1 if ($cont=~m#\Q>firsta\E$(.*)$#sm);

my (%first, %firsta);
for (my $i=1;$first=~m#(.+)#mg;$i++){
$i=sprintf("%04d", $i);
$first{$i}=$1;
}
for (my $i=1;$firsta=~m#(.+)#g;$i++){
	$i=sprintf("%04d", $i);
	$firsta{$i}=$1;
}

foreach (sort keys %firsta){
if ($first{$_} ne $firsta{$_}){
print ("$first{$_} unmatched in line ",  ($_ + 0), "\n");
}
}

Thanks and regards,
yuvanbala

try This code,

open my $txt, "<", "$ARGV[0]";
read $txt, my $cont, -s $txt;
close $txt;
my $first=$1 if ($cont=~m#\Q>first\E$(.*?)\Q>firsta\E$#sm);
my $firsta=$1 if ($cont=~m#\Q>firsta\E$(.*)$#sm);

my (%first, %firsta);
for (my $i=1;$first=~m#(.+)#mg;$i++){
$i=sprintf("%04d", $i);
$first{$i}=$1;
}
for (my $i=1;$firsta=~m#(.+)#g;$i++){
	$i=sprintf("%04d", $i);
	$firsta{$i}=$1;
}

foreach (sort keys %firsta){
if ($first{$_} ne $firsta{$_}){
print ("$first{$_} unmatched in line ",  ($_ + 0), "\n");
}
}

Thanks and regards,
yuvanbala

hi thanks..but it doesn't give me an opportunity to specify my file name and path..also..as I am a new programmer, comments would help me..thanks a lot..

Read the command line arguments and pass your file name as below.

perl filename.pl example.txt

$ARGV[0] - consider as the input file.

Read the command line arguments and pass your file name as below.

perl filename.pl example.txt

$ARGV[0] - consider as the input file.

hii..thanks.. though this one gives the unmatched part only..one thing..as I have other files which do not contain >first and >firsta..so do I need to change it everytime?
thnks a lot

hii..thanks.. though this one gives the unmatched part only..one thing..as I have other files which do not contain >first and >firsta..so do I need to change it everytime?
thnks a lot

Another bug is..it actually extracts the whole line..not only the unmatched part..my problem was like this..
first, it would extract the matched part between >first and >firsta..then it should extract the unmatched part from >first and >firsta..
thanks

Hi, already you says that '-' is the only difference. hence i fixed it. then the scipt will be.

open my $txt, "<", "$ARGV[0]";
read $txt, my $cont, -s $txt;
close $txt;
my $first=$1 if ($cont=~m#\Q>first\E$(.*?)\Q>firsta\E$#sm);
my $firsta=$1 if ($cont=~m#\Q>firsta\E$(.*)$#sm);

my (%first, %firsta);
for (my $i=1;$first=~m#(.+)#mg;$i++){
$i=sprintf("%04d", $i);
$first{$i}=$1;
}
for (my $i=1;$firsta=~m#(.+)#g;$i++){
	$i=sprintf("%04d", $i);
	$firsta{$i}=$1;
}

foreach (sort keys %firsta){
if ($first{$_} eq $firsta{$_}){
#print ("$first{$_} matched in line ",  ($_ + 0), "\n");
push (@matched, ($_ + 0));
}
else{
$firsta{$_}=~s#\-##g;
$first{$_}=~s#$firsta{$_}##g;
#print ("$first{$_} unmatched in line ",  ($_ + 0), "\n");
push (@unmatched, ($_ + 0));
}
}
print "List of Matched lines\n", '=' x 25, "\n"; 
print "matched line are $_\n" for @matched;

print "\n\nList of Unmatched lines\n", '=' x 25, "\n"; 
foreach (@unmatched){
print "Line $_:";
$_=sprintf("%04d", $_);
print "\t$first{$_}\n";
}

thanks and regards,
yuvanbala

try this code, Now you may not specify >first, >firsta like that, the script automatically recover that,

open my $txt, "<", "$ARGV[0]";
read $txt, my $cont, -s $txt;
close $txt;


my $first=$1 if ($cont=~m#\A^>.*?$(.*?)(?=>)#sm);
my $firsta=$1 if ($cont=~m#.+?^>.*?$(.*?)\Z#sm);

my (%first, %firsta);
for (my $i=1;$first=~m#(.+)#mg;$i++){
$i=sprintf("%04d", $i);
$first{$i}=$1;
}
for (my $i=1;$firsta=~m#(.+)#g;$i++){
	$i=sprintf("%04d", $i);
	$firsta{$i}=$1;
}

foreach (sort keys %firsta){
if ($first{$_} eq $firsta{$_}){
#print ("$first{$_} matched in line ",  ($_ + 0), "\n");
push (@matched, ($_ + 0));
}
else{
$firsta{$_}=~s#\-##g;
$first{$_}=~s#$firsta{$_}##g;
#print ("$first{$_} unmatched in line ",  ($_ + 0), "\n");
push (@unmatched, ($_ + 0));
}
}
print "List of Matched lines\n", '=' x 25, "\n"; 
print "matched line are $_\n" for @matched;

print "\n\nList of Unmatched lines\n", '=' x 25, "\n"; 
foreach (@unmatched){
print "Line $_:";
$_=sprintf("%04d", $_);
print "\t$first{$_}\n";
}

Thanks and regards,
yuvanbala

commented: Continuous help!!! +2

hii..thanks..works fine...will mark it as solved so u get proper credit for it..
It Would be great if you could tell me which book(s) you are following/followed to learn Perl..since I am new..it'll help me ..cheers...

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.