Hi all ,

I have some data with a lot of colums but I just want to put out some colums in the diffirent file.
I find in forum that have the question http://www.daniweb.com/software-development/perl/threads/377421. I used that to solve my problem but it is not seccesfull.
I can put out the data with

num  name              product               star_posi        end_posi
1    [gene=KIQ_00005] [protein=hypothetical [location=complement(<1..423)]

but I can not seperare the number of [location] and I can not delete [] of data.
Below is my data:

>lcl|AGQQ01000001.1_cdsid_EHE85001.1 [gene=KIQ_00005] [protein=hypothetical protein] [partial=3'] [protein_id=EHE85001.1] [location=complement(<1..423)]
MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHT
IGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDP
E
>lcl|AGQQ01000001.1_cdsid_EHE85002.1 [gene=KIQ_00010] [protein=hypothetical protein] [protein_id=EHE85002.1] [location=710..1225]
MKPGHFYCEHCGVAHFGAPALNLKLPDPVIEAQSRGHRVSTGSSVCQILNPKPKRHFIKSNIEIPIDGGK
KKLDYGGWVEVEHSDLMTYLNYRNVMKKVRIPGFLASKFPGLEDSYGTPVLLTVKHEDYYPHFQPLEESS
SMYQDFHKGISSREADLRINSWLLVTHEYMR
etc...................................................

I hope the out put:

num   name           product              kind      star_posi        end_posi
1     gene=KIQ_00005 protein=hypothetical protein      1               423
2     gene=KIQ_00010 protein=hypothetical protein     710              1225
etc...............................................

Could you help me to sovle this problem? Thank you very much.

Recommended Answers

All 9 Replies

...I can put out the data with

num  name              product               star_posi        end_posi
1    [gene=KIQ_00005] [protein=hypothetical [location=complement(<1..423)]

but I can not seperare the number of [location] and I can not delete [] of data...

To remove the square brackets from your text string you can do a regex substitution.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $name = '[gene=KIQ_00005]';
print "$name\n";

my $character_to_remove = '\['; #add required escape character before [
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
print "$name\n";

$character_to_remove = '\]'; #add required escape character before ]
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
print "$name\n";

To extract two substrings of sequential digits from a string you can do a regex match using the /g option to get a list.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $location = '[location=complement(<1..423)]';

my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits

print "Start position is $start and end position is $end";

Thank you very much for your help. It was very good.

But I have more problem:
Using DATA that were selected,it run very well. But if I inserted the file 1406_01.txt it has error.

Could you please show me how to solve that problem ?

Below I used your advice to make the script :

#!/usr/bin/perl;
    use strict;
    use warnings;
    use Data::Dumper;
     
    my @AoH;#Array of hash references
    while(my $line = <DATA>){
    chomp($line);
    my ($lcl, $id, $filename, $size, $aa) = split/\s+/, $line;
    push @AoH, {id => $id,
    filename => $filename,
    size => $size,
    aa   =>$aa};
    }
    print "ID\tPro\tstar\tend\n";
     
    foreach(@AoH)
    {
    my $name = $$_{filename};
    my $character_to_remove = '\['; #add required escape character before [
    $name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
    $character_to_remove = '\]'; #add required escape character before ]
    $name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
     
     
    my $name1 = $$_{size};
    #my $character_to_remove1 = '\['; #add required escape character before [
    $name1 =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
    #$character_to_remove1 = '\]'; #add required escape character before ]
    $name1 =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
     
    my $location = $$_{aa};
    my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits
    print "$name\t$name1\t$start\t$end\n";
    }
     
    __DATA__
    >lcl|AGQQ01000001.1_cdsid_EHE85001.1 [gene=KIQ_00005] [protein_id=EHE85001.1] [location=complement(<1..423)] MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHTIGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDPE
    >lcl|AGQQ01000001.1_cdsid_EHE85002.1 [gene=KIQ_00010] [protein_id=EHE85002.1] [location=710..1225] MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHTIGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDPE

To extract two substrings of sequential digits from a string you can do a regex match using the /g option to get a list.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $location = '[location=complement(<1..423)]';

my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits

print "Start position is $start and end position is $end";

Thank d5e5 very much.
I used your advices to make the script. It is run very well when I selected the data.
Now, I am trying to run the script with the 1406_01.txt which was attached in this question last time, I have some erorr.
I hope the put out data with :

Id              product                  start     end    
gene=KIQ_00005  protein_id=EHE85001.1     1        423
gene=KIQ_00010  protein_id=EHE85002.1    710       1225

Could you show me how can I put out that form 1406_01.txt data?

...Now, I am trying to run the script with the 1406_01.txt which was attached in this question last time, I have some erorr.
I hope the put out data with :

Id              product                  start     end    
gene=KIQ_00005  protein_id=EHE85001.1     1        423
gene=KIQ_00010  protein_id=EHE85002.1    710       1225

Could you show me how can I put out that form 1406_01.txt data?

Because location is not always in the same column I use regular expressions (regex) instead of split to extract the desired data. To understand what the regex means I recommend this link to a Regex Tutorial.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $filename = '14067_01.txt';

open my $fh, '<', $filename or die "Failed to open $filename: $!";

printf "%s%21s%21s%7s\n", qw(id product start end);
while (my $rec = <$fh>){
    next unless $rec =~ m/^>/;#Skip all lines other than the first line of record
    chomp($rec);
#Because location is not always in the same column I use regular expressions (regex)
#instead of split to extract the desired data.
    my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    $loc = 'undefined' unless defined $loc;
    my ($start, $end) = $loc =~ m/\d+/g;
    printf "%-16s%21s%7s%7s\n", ($name, $product, $start, $end);
}
close $fh;

Thanks d5e5! It run very well!

Because location is not always in the same column I use regular expressions (regex) instead of split to extract the desired data. To understand what the regex means I recommend this link to a Regex Tutorial.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $filename = '14067_01.txt';

open my $fh, '<', $filename or die "Failed to open $filename: $!";

printf "%s%21s%21s%7s\n", qw(id product start end);
while (my $rec = <$fh>){
    next unless $rec =~ m/^>/;#Skip all lines other than the first line of record
    chomp($rec);
#Because location is not always in the same column I use regular expressions (regex)
#instead of split to extract the desired data.
    my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    $loc = 'undefined' unless defined $loc;
    my ($start, $end) = $loc =~ m/\d+/g;
    printf "%-16s%21s%7s%7s\n", ($name, $product, $start, $end);
}
close $fh;

Hi d5e5,
Thank you for your help and support. it run very well. Could you please show me more 2 questions?

1. How can I write more the colums in the same line of your script?

my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;

This my repair :

my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    my ($po) =  $_ =~ m/\[(protein=.+?)\]/g;

2. I have 43 files data to analyse in my project. Do you have the way to run 43 files data on one times?

Thank you very much for your help.

If the script reads each line into the $rec variable, then you won't find anything in the $_ variable. my ($po) = $rec =~ m/\[(protein=.+?)\]/g; should work OK.

If you want to read many files, one at a time, you can assign the list of input file names into the @ARGV array and then read from the empty diamond operator.

#!/usr/bin/perl;
use strict;
use warnings;

#Assign any number of file names to Perl's special @ARGV array
@ARGV = qw(file1.txt file2.txt file3.txt);#Example with 3 file names

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

You can get the same result by using a glob pattern to put the list of desired files into @ARGV, which may be easier than typing 43 file names.

#!/usr/bin/perl;
use strict;
use warnings;

@ARGV = <file?.txt>;#Glob list of files matching pattern. ? stands for any character

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

If the script reads each line into the $rec variable, then you won't find anything in the $_ variable. my ($po) = $rec =~ m/\[(protein=.+?)\]/g; should work OK.

If you want to read many files, one at a time, you can assign the list of input file names into the @ARGV array and then read from the empty diamond operator.

#!/usr/bin/perl;
use strict;
use warnings;

#Assign any number of file names to Perl's special @ARGV array
@ARGV = qw(file1.txt file2.txt file3.txt);#Example with 3 file names

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

You can get the same result by using a glob pattern to put the list of desired files into @ARGV, which may be easier than typing 43 file names.

#!/usr/bin/perl;
use strict;
use warnings;

@ARGV = <file?.txt>;#Glob list of files matching pattern. ? stands for any character

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

Thanks a lot d5e5! It works beautifully!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.