Hi all ,

I have some data with a lot of colums but I just want to put out some colums in the diffirent file.
I find in forum that have the question http://www.daniweb.com/software-development/perl/threads/377421. I used that to solve my problem but it is not seccesfull.
I can put out the data with

num  name              product               star_posi        end_posi
1    [gene=KIQ_00005] [protein=hypothetical [location=complement(<1..423)]

but I can not seperare the number of [location] and I can not delete [] of data.
Below is my data:

>lcl|AGQQ01000001.1_cdsid_EHE85001.1 [gene=KIQ_00005] [protein=hypothetical protein] [partial=3'] [protein_id=EHE85001.1] [location=complement(<1..423)]
MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHT
IGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDP
E
>lcl|AGQQ01000001.1_cdsid_EHE85002.1 [gene=KIQ_00010] [protein=hypothetical protein] [protein_id=EHE85002.1] [location=710..1225]
MKPGHFYCEHCGVAHFGAPALNLKLPDPVIEAQSRGHRVSTGSSVCQILNPKPKRHFIKSNIEIPIDGGK
KKLDYGGWVEVEHSDLMTYLNYRNVMKKVRIPGFLASKFPGLEDSYGTPVLLTVKHEDYYPHFQPLEESS
SMYQDFHKGISSREADLRINSWLLVTHEYMR
etc...................................................

I hope the out put:

num   name           product              kind      star_posi        end_posi
1     gene=KIQ_00005 protein=hypothetical protein      1               423
2     gene=KIQ_00010 protein=hypothetical protein     710              1225
etc...............................................

Could you help me to sovle this problem? Thank you very much.

Edited 4 Years Ago by biojet: n/a

Attachments
>lcl|AGQQ01000001.1_cdsid_EHE85001.1 [gene=KIQ_00005] [protein=hypothetical protein] [partial=3'] [protein_id=EHE85001.1] [location=complement(<1..423)]
MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHT
IGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDP
E
>lcl|AGQQ01000001.1_cdsid_EHE85002.1 [gene=KIQ_00010] [protein=hypothetical protein] [protein_id=EHE85002.1] [location=710..1225]
MKPGHFYCEHCGVAHFGAPALNLKLPDPVIEAQSRGHRVSTGSSVCQILNPKPKRHFIKSNIEIPIDGGK
KKLDYGGWVEVEHSDLMTYLNYRNVMKKVRIPGFLASKFPGLEDSYGTPVLLTVKHEDYYPHFQPLEESS
SMYQDFHKGISSREADLRINSWLLVTHEYMR
>lcl|AGQQ01000001.1_cdsid_EHE85003.1 [gene=KIQ_00015] [protein=hypothetical protein] [protein_id=EHE85003.1] [location=complement(1226..1885)]
MFHAVSFALMDSVNVLLIGIIVAIAALLPRKGKYGPIATLLVAGDWLGVFLLSILVMLVFDGLEDLVQGF
LDSIWFGVILLVTGIVSFVATLVSKTDSTRKLDGFLAPVKTPSWKTVGAGLILGIVQSATSVPFYAGLGY
LSVGNFSPEIRYGGLVVYATLALSLPIIVAILVGMVRKYPESPVGRLFELIGQNKERVTKWSGYLVSLVL
CIMGITSIL
>lcl|AGQQ01000001.1_cdsid_EHE85004.1 [gene=KIQ_00020] [protein=ribonuclease activity regulator protein RraA] [protein_id=EHE85004.1] [location=complement(2038..2544)]
MGMTQSAPEFIATADLVDIIGDNAQSCDTQFQNLGGATEFHGTITTVKCFQDNALLKSILSEDNPGGVLV
IDGDASVHTALVGDIIAGLGKDHGWSGVIVNGAIRDSAVIGTMTFGCKALGTNPRKSTKTGSGERDVVVS
IGGIDFIPGHYVYADSDGIIVTEAPIKQ
>lcl|AGQQ01000001.1_cdsid_EHE85005.1 [gene=KIQ_00025] [protein=RND superfamily drug exporter] [protein_id=EHE85005.1] [location=2791..5166]
MAKFLYKLGSTAYQKKWPFLAVWLVILIGITALAGLYAKPTSSSFSIPGLDSVTTMEKMQERFPGSDDAT
SAPTGSVVIQAPEGKTLTDPEVEAEINQMLDEVRATGVLKDADSVVDPVLAAQGVAAQMTPALEAQGVPA
EKIAADIESISPLSADETTGIISMTFDADSAMDVSAEDREKVTNILNEYDDGDLTVVYNGNVFGAAATSL
DMTSELIGLLVAAVVLIVTFGSFIAAGMPLISAIIGVGIGIMGIQLATAFTDSVNDMTPTLASMIGLAVG
IDYALFIVSRFRNELISQTGANDLEPKELAERLRTMPLAARAHAMGMAVGTAGSAVVFAGTTVLIALVAL
SIINIPFLTVMAIAAAITVAIAVLVALSFLPALLGLLGTRIFAARVPGPKVPDPEDEKPTMGLKWVRLVR
KMPVAYLLVGVVLLGAIAIPATNMRLAMPTDGTSTLGTAPRTAYDMTADAFGPGRNAPMIALIDATDVPE
EERPLVFGQAVEQFLNTDGVKNAQITQTTENFDTAQILITPEFDAIDERTSETLATLRADAETFADDTGA
TYGITGVTPIYDDISARLGDVLVPYVLIVLVLAFLVLLLVFRSIWVPLIAALGFGLSVLATFGATVAIFQ
EGAFGIIDDPQPLLSFLPIMLIGLVFGLAMDYQIFLVTRMREGFTKGKTAGNATSNGFKHGARVVTAAAL
IMVSVFAAFIAQDMAFIKTMGFALAVAVFFDAFVVRMMIIPATMFLLDDKAWWLPKWLDKILPNVDVEGE
GLSELHEARTEELKENVGVGA
>lcl|AGQQ01000001.1_cdsid_EHE85006.1 [gene=KIQ_00030] [protein=hypothetical protein] [protein_id=EHE85006.1] [location=complement(5113..5250)]
MNASGPSIKAISAAARDNAVRVAAFFVSLSPDTYIFLQFLGASLM
>lcl|AGQQ01000001.1_cdsid_EHE85007.1 [gene=KIQ_00035] [protein=TETR-family transcriptional regulator] [protein_id=EHE85007.1] [location=5153..5734]
MSGLRETKKAATRTALSRAAAEIALMEGPEAFTVAAIAAAAGVSPRTFHNYFPSREDALVQFVVIRVQEL
TDQLYEFPTSVPPRDAIEQLVINQLRDGDDAMDSFSAMFRIGEILENLDPIKCVIDKERLIAPLLEFMVE
RDKDLDKFDAATLIHLHAAAIATSLHTFYQAPEPRDIEDGVALIRRACAWIKK
>lcl|AGQQ01000001.1_cdsid_EHE85008.1 [gene=KIQ_00040] [protein=corynomycolyl transferase] [partial=5'] [protein_id=EHE85008.1] [location=complement(5793..>6155)]
AMSNTCTHNLKAATDQMGIDNINYDFRPTGTHAWDYWNEALHRFFPLMMQGFGLDGGPIPIYNPNGVSSS
ESSSELSSDVSLGTVIGSVAGSSGSSEGSSVREFLAGSSGSSQSTGSFYE

...I can put out the data with

num  name              product               star_posi        end_posi
1    [gene=KIQ_00005] [protein=hypothetical [location=complement(<1..423)]

but I can not seperare the number of [location] and I can not delete [] of data...

To remove the square brackets from your text string you can do a regex substitution.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $name = '[gene=KIQ_00005]';
print "$name\n";

my $character_to_remove = '\['; #add required escape character before [
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
print "$name\n";

$character_to_remove = '\]'; #add required escape character before ]
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
print "$name\n";

To extract two substrings of sequential digits from a string you can do a regex match using the /g option to get a list.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $location = '[location=complement(<1..423)]';

my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits

print "Start position is $start and end position is $end";

Thank you very much for your help. It was very good.

But I have more problem:
Using DATA that were selected,it run very well. But if I inserted the file 1406_01.txt it has error.

Could you please show me how to solve that problem ?

Below I used your advice to make the script :

#!/usr/bin/perl;
    use strict;
    use warnings;
    use Data::Dumper;
     
    my @AoH;#Array of hash references
    while(my $line = <DATA>){
    chomp($line);
    my ($lcl, $id, $filename, $size, $aa) = split/\s+/, $line;
    push @AoH, {id => $id,
    filename => $filename,
    size => $size,
    aa   =>$aa};
    }
    print "ID\tPro\tstar\tend\n";
     
    foreach(@AoH)
    {
    my $name = $$_{filename};
    my $character_to_remove = '\['; #add required escape character before [
    $name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
    $character_to_remove = '\]'; #add required escape character before ]
    $name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
     
     
    my $name1 = $$_{size};
    #my $character_to_remove1 = '\['; #add required escape character before [
    $name1 =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
    #$character_to_remove1 = '\]'; #add required escape character before ]
    $name1 =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
     
    my $location = $$_{aa};
    my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits
    print "$name\t$name1\t$start\t$end\n";
    }
     
    __DATA__
    >lcl|AGQQ01000001.1_cdsid_EHE85001.1 [gene=KIQ_00005] [protein_id=EHE85001.1] [location=complement(<1..423)] MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHTIGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDPE
    >lcl|AGQQ01000001.1_cdsid_EHE85002.1 [gene=KIQ_00010] [protein_id=EHE85002.1] [location=710..1225] MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHTIGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDPE

Edited 4 Years Ago by biojet: n/a

To extract two substrings of sequential digits from a string you can do a regex match using the /g option to get a list.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $location = '[location=complement(<1..423)]';

my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits

print "Start position is $start and end position is $end";

Thank d5e5 very much.
I used your advices to make the script. It is run very well when I selected the data.
Now, I am trying to run the script with the 1406_01.txt which was attached in this question last time, I have some erorr.
I hope the put out data with :

Id              product                  start     end    
gene=KIQ_00005  protein_id=EHE85001.1     1        423
gene=KIQ_00010  protein_id=EHE85002.1    710       1225

Could you show me how can I put out that form 1406_01.txt data?

...Now, I am trying to run the script with the 1406_01.txt which was attached in this question last time, I have some erorr.
I hope the put out data with :

Id              product                  start     end    
gene=KIQ_00005  protein_id=EHE85001.1     1        423
gene=KIQ_00010  protein_id=EHE85002.1    710       1225

Could you show me how can I put out that form 1406_01.txt data?

Because location is not always in the same column I use regular expressions (regex) instead of split to extract the desired data. To understand what the regex means I recommend this link to a Regex Tutorial.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $filename = '14067_01.txt';

open my $fh, '<', $filename or die "Failed to open $filename: $!";

printf "%s%21s%21s%7s\n", qw(id product start end);
while (my $rec = <$fh>){
    next unless $rec =~ m/^>/;#Skip all lines other than the first line of record
    chomp($rec);
#Because location is not always in the same column I use regular expressions (regex)
#instead of split to extract the desired data.
    my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    $loc = 'undefined' unless defined $loc;
    my ($start, $end) = $loc =~ m/\d+/g;
    printf "%-16s%21s%7s%7s\n", ($name, $product, $start, $end);
}
close $fh;

Edited 4 Years Ago by d5e5: n/a

Thanks d5e5! It run very well!

Edited 4 Years Ago by biojet: n/a

Because location is not always in the same column I use regular expressions (regex) instead of split to extract the desired data. To understand what the regex means I recommend this link to a Regex Tutorial.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $filename = '14067_01.txt';

open my $fh, '<', $filename or die "Failed to open $filename: $!";

printf "%s%21s%21s%7s\n", qw(id product start end);
while (my $rec = <$fh>){
    next unless $rec =~ m/^>/;#Skip all lines other than the first line of record
    chomp($rec);
#Because location is not always in the same column I use regular expressions (regex)
#instead of split to extract the desired data.
    my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    $loc = 'undefined' unless defined $loc;
    my ($start, $end) = $loc =~ m/\d+/g;
    printf "%-16s%21s%7s%7s\n", ($name, $product, $start, $end);
}
close $fh;

Hi d5e5,
Thank you for your help and support. it run very well. Could you please show me more 2 questions?

1. How can I write more the colums in the same line of your script?

my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;

This my repair :

my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    my ($po) =  $_ =~ m/\[(protein=.+?)\]/g;

2. I have 43 files data to analyse in my project. Do you have the way to run 43 files data on one times?

Thank you very much for your help.

If the script reads each line into the $rec variable, then you won't find anything in the $_ variable. my ($po) = $rec =~ m/\[(protein=.+?)\]/g; should work OK.

If you want to read many files, one at a time, you can assign the list of input file names into the @ARGV array and then read from the empty diamond operator.

#!/usr/bin/perl;
use strict;
use warnings;

#Assign any number of file names to Perl's special @ARGV array
@ARGV = qw(file1.txt file2.txt file3.txt);#Example with 3 file names

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

You can get the same result by using a glob pattern to put the list of desired files into @ARGV, which may be easier than typing 43 file names.

#!/usr/bin/perl;
use strict;
use warnings;

@ARGV = <file?.txt>;#Glob list of files matching pattern. ? stands for any character

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

Edited 4 Years Ago by d5e5: Added glob example

If the script reads each line into the $rec variable, then you won't find anything in the $_ variable. my ($po) = $rec =~ m/\[(protein=.+?)\]/g; should work OK.

If you want to read many files, one at a time, you can assign the list of input file names into the @ARGV array and then read from the empty diamond operator.

#!/usr/bin/perl;
use strict;
use warnings;

#Assign any number of file names to Perl's special @ARGV array
@ARGV = qw(file1.txt file2.txt file3.txt);#Example with 3 file names

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

You can get the same result by using a glob pattern to put the list of desired files into @ARGV, which may be easier than typing 43 file names.

#!/usr/bin/perl;
use strict;
use warnings;

@ARGV = <file?.txt>;#Glob list of files matching pattern. ? stands for any character

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

Thanks a lot d5e5! It works beautifully!

Edited 4 Years Ago by biojet: n/a

This question has already been answered. Start a new discussion instead.