0

Hi all ,

I have some data with a lot of colums but I just want to put out some colums in the diffirent file.
I find in forum that have the question http://www.daniweb.com/software-development/perl/threads/377421. I used that to solve my problem but it is not seccesfull.
I can put out the data with

num  name              product               star_posi        end_posi
1    [gene=KIQ_00005] [protein=hypothetical [location=complement(<1..423)]

but I can not seperare the number of [location] and I can not delete [] of data.
Below is my data:

>lcl|AGQQ01000001.1_cdsid_EHE85001.1 [gene=KIQ_00005] [protein=hypothetical protein] [partial=3'] [protein_id=EHE85001.1] [location=complement(<1..423)]
MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHT
IGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDP
E
>lcl|AGQQ01000001.1_cdsid_EHE85002.1 [gene=KIQ_00010] [protein=hypothetical protein] [protein_id=EHE85002.1] [location=710..1225]
MKPGHFYCEHCGVAHFGAPALNLKLPDPVIEAQSRGHRVSTGSSVCQILNPKPKRHFIKSNIEIPIDGGK
KKLDYGGWVEVEHSDLMTYLNYRNVMKKVRIPGFLASKFPGLEDSYGTPVLLTVKHEDYYPHFQPLEESS
SMYQDFHKGISSREADLRINSWLLVTHEYMR
etc...................................................

I hope the out put:

num   name           product              kind      star_posi        end_posi
1     gene=KIQ_00005 protein=hypothetical protein      1               423
2     gene=KIQ_00010 protein=hypothetical protein     710              1225
etc...............................................

Could you help me to sovle this problem? Thank you very much.

Edited by biojet: n/a

Attachments
>lcl|AGQQ01000001.1_cdsid_EHE85001.1 [gene=KIQ_00005] [protein=hypothetical protein] [partial=3'] [protein_id=EHE85001.1] [location=complement(<1..423)]
MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHT
IGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDP
E
>lcl|AGQQ01000001.1_cdsid_EHE85002.1 [gene=KIQ_00010] [protein=hypothetical protein] [protein_id=EHE85002.1] [location=710..1225]
MKPGHFYCEHCGVAHFGAPALNLKLPDPVIEAQSRGHRVSTGSSVCQILNPKPKRHFIKSNIEIPIDGGK
KKLDYGGWVEVEHSDLMTYLNYRNVMKKVRIPGFLASKFPGLEDSYGTPVLLTVKHEDYYPHFQPLEESS
SMYQDFHKGISSREADLRINSWLLVTHEYMR
>lcl|AGQQ01000001.1_cdsid_EHE85003.1 [gene=KIQ_00015] [protein=hypothetical protein] [protein_id=EHE85003.1] [location=complement(1226..1885)]
MFHAVSFALMDSVNVLLIGIIVAIAALLPRKGKYGPIATLLVAGDWLGVFLLSILVMLVFDGLEDLVQGF
LDSIWFGVILLVTGIVSFVATLVSKTDSTRKLDGFLAPVKTPSWKTVGAGLILGIVQSATSVPFYAGLGY
LSVGNFSPEIRYGGLVVYATLALSLPIIVAILVGMVRKYPESPVGRLFELIGQNKERVTKWSGYLVSLVL
CIMGITSIL
>lcl|AGQQ01000001.1_cdsid_EHE85004.1 [gene=KIQ_00020] [protein=ribonuclease activity regulator protein RraA] [protein_id=EHE85004.1] [location=complement(2038..2544)]
MGMTQSAPEFIATADLVDIIGDNAQSCDTQFQNLGGATEFHGTITTVKCFQDNALLKSILSEDNPGGVLV
IDGDASVHTALVGDIIAGLGKDHGWSGVIVNGAIRDSAVIGTMTFGCKALGTNPRKSTKTGSGERDVVVS
IGGIDFIPGHYVYADSDGIIVTEAPIKQ
>lcl|AGQQ01000001.1_cdsid_EHE85005.1 [gene=KIQ_00025] [protein=RND superfamily drug exporter] [protein_id=EHE85005.1] [location=2791..5166]
MAKFLYKLGSTAYQKKWPFLAVWLVILIGITALAGLYAKPTSSSFSIPGLDSVTTMEKMQERFPGSDDAT
SAPTGSVVIQAPEGKTLTDPEVEAEINQMLDEVRATGVLKDADSVVDPVLAAQGVAAQMTPALEAQGVPA
EKIAADIESISPLSADETTGIISMTFDADSAMDVSAEDREKVTNILNEYDDGDLTVVYNGNVFGAAATSL
DMTSELIGLLVAAVVLIVTFGSFIAAGMPLISAIIGVGIGIMGIQLATAFTDSVNDMTPTLASMIGLAVG
IDYALFIVSRFRNELISQTGANDLEPKELAERLRTMPLAARAHAMGMAVGTAGSAVVFAGTTVLIALVAL
SIINIPFLTVMAIAAAITVAIAVLVALSFLPALLGLLGTRIFAARVPGPKVPDPEDEKPTMGLKWVRLVR
KMPVAYLLVGVVLLGAIAIPATNMRLAMPTDGTSTLGTAPRTAYDMTADAFGPGRNAPMIALIDATDVPE
EERPLVFGQAVEQFLNTDGVKNAQITQTTENFDTAQILITPEFDAIDERTSETLATLRADAETFADDTGA
TYGITGVTPIYDDISARLGDVLVPYVLIVLVLAFLVLLLVFRSIWVPLIAALGFGLSVLATFGATVAIFQ
EGAFGIIDDPQPLLSFLPIMLIGLVFGLAMDYQIFLVTRMREGFTKGKTAGNATSNGFKHGARVVTAAAL
IMVSVFAAFIAQDMAFIKTMGFALAVAVFFDAFVVRMMIIPATMFLLDDKAWWLPKWLDKILPNVDVEGE
GLSELHEARTEELKENVGVGA
>lcl|AGQQ01000001.1_cdsid_EHE85006.1 [gene=KIQ_00030] [protein=hypothetical protein] [protein_id=EHE85006.1] [location=complement(5113..5250)]
MNASGPSIKAISAAARDNAVRVAAFFVSLSPDTYIFLQFLGASLM
>lcl|AGQQ01000001.1_cdsid_EHE85007.1 [gene=KIQ_00035] [protein=TETR-family transcriptional regulator] [protein_id=EHE85007.1] [location=5153..5734]
MSGLRETKKAATRTALSRAAAEIALMEGPEAFTVAAIAAAAGVSPRTFHNYFPSREDALVQFVVIRVQEL
TDQLYEFPTSVPPRDAIEQLVINQLRDGDDAMDSFSAMFRIGEILENLDPIKCVIDKERLIAPLLEFMVE
RDKDLDKFDAATLIHLHAAAIATSLHTFYQAPEPRDIEDGVALIRRACAWIKK
>lcl|AGQQ01000001.1_cdsid_EHE85008.1 [gene=KIQ_00040] [protein=corynomycolyl transferase] [partial=5'] [protein_id=EHE85008.1] [location=complement(5793..>6155)]
AMSNTCTHNLKAATDQMGIDNINYDFRPTGTHAWDYWNEALHRFFPLMMQGFGLDGGPIPIYNPNGVSSS
ESSSELSSDVSLGTVIGSVAGSSGSSEGSSVREFLAGSSGSSQSTGSFYE
2
Contributors
9
Replies
10
Views
5 Years
Discussion Span
Last Post by biojet
0

...I can put out the data with

num  name              product               star_posi        end_posi
1    [gene=KIQ_00005] [protein=hypothetical [location=complement(<1..423)]

but I can not seperare the number of [location] and I can not delete [] of data...

To remove the square brackets from your text string you can do a regex substitution.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $name = '[gene=KIQ_00005]';
print "$name\n";

my $character_to_remove = '\['; #add required escape character before [
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
print "$name\n";

$character_to_remove = '\]'; #add required escape character before ]
$name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
print "$name\n";
0

To extract two substrings of sequential digits from a string you can do a regex match using the /g option to get a list.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $location = '[location=complement(<1..423)]';

my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits

print "Start position is $start and end position is $end";
0

Thank you very much for your help. It was very good.

But I have more problem:
Using DATA that were selected,it run very well. But if I inserted the file 1406_01.txt it has error.

Could you please show me how to solve that problem ?

Below I used your advice to make the script :

#!/usr/bin/perl;
    use strict;
    use warnings;
    use Data::Dumper;
     
    my @AoH;#Array of hash references
    while(my $line = <DATA>){
    chomp($line);
    my ($lcl, $id, $filename, $size, $aa) = split/\s+/, $line;
    push @AoH, {id => $id,
    filename => $filename,
    size => $size,
    aa   =>$aa};
    }
    print "ID\tPro\tstar\tend\n";
     
    foreach(@AoH)
    {
    my $name = $$_{filename};
    my $character_to_remove = '\['; #add required escape character before [
    $name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
    $character_to_remove = '\]'; #add required escape character before ]
    $name =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
     
     
    my $name1 = $$_{size};
    #my $character_to_remove1 = '\['; #add required escape character before [
    $name1 =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005]
    #$character_to_remove1 = '\]'; #add required escape character before ]
    $name1 =~ s/$character_to_remove//;# $name now contains gene=KIQ_00005
     
    my $location = $$_{aa};
    my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits
    print "$name\t$name1\t$start\t$end\n";
    }
     
    __DATA__
    >lcl|AGQQ01000001.1_cdsid_EHE85001.1 [gene=KIQ_00005] [protein_id=EHE85001.1] [location=complement(<1..423)] MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHTIGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDPE
    >lcl|AGQQ01000001.1_cdsid_EHE85002.1 [gene=KIQ_00010] [protein_id=EHE85002.1] [location=710..1225] MSITTHVQALTTALNAIDNHLASMLDHGVTPDQYKAIEPDLIALEHTINHHATIAAQTTALAERTNAAHTIGSTHLIDYLTTTFGLSKARAHHRINLAHSLYPIPKPNSGSGNGGNGGNPDAGPDGGPGDDDSGDDDPDPE

Edited by biojet: n/a

0

To extract two substrings of sequential digits from a string you can do a regex match using the /g option to get a list.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $location = '[location=complement(<1..423)]';

my ($start, $end) = $location =~ m/\d+/g;#Regex match makes list of substrings of sequential digits

print "Start position is $start and end position is $end";

Thank d5e5 very much.
I used your advices to make the script. It is run very well when I selected the data.
Now, I am trying to run the script with the 1406_01.txt which was attached in this question last time, I have some erorr.
I hope the put out data with :

Id              product                  start     end    
gene=KIQ_00005  protein_id=EHE85001.1     1        423
gene=KIQ_00010  protein_id=EHE85002.1    710       1225

Could you show me how can I put out that form 1406_01.txt data?

0

...Now, I am trying to run the script with the 1406_01.txt which was attached in this question last time, I have some erorr.
I hope the put out data with :

Id              product                  start     end    
gene=KIQ_00005  protein_id=EHE85001.1     1        423
gene=KIQ_00010  protein_id=EHE85002.1    710       1225

Could you show me how can I put out that form 1406_01.txt data?

Because location is not always in the same column I use regular expressions (regex) instead of split to extract the desired data. To understand what the regex means I recommend this link to a Regex Tutorial.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $filename = '14067_01.txt';

open my $fh, '<', $filename or die "Failed to open $filename: $!";

printf "%s%21s%21s%7s\n", qw(id product start end);
while (my $rec = <$fh>){
    next unless $rec =~ m/^>/;#Skip all lines other than the first line of record
    chomp($rec);
#Because location is not always in the same column I use regular expressions (regex)
#instead of split to extract the desired data.
    my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    $loc = 'undefined' unless defined $loc;
    my ($start, $end) = $loc =~ m/\d+/g;
    printf "%-16s%21s%7s%7s\n", ($name, $product, $start, $end);
}
close $fh;

Edited by d5e5: n/a

0

Thanks d5e5! It run very well!

Edited by biojet: n/a

0

Because location is not always in the same column I use regular expressions (regex) instead of split to extract the desired data. To understand what the regex means I recommend this link to a Regex Tutorial.

#!/usr/bin/perl;
use strict; 
use warnings; 

my $filename = '14067_01.txt';

open my $fh, '<', $filename or die "Failed to open $filename: $!";

printf "%s%21s%21s%7s\n", qw(id product start end);
while (my $rec = <$fh>){
    next unless $rec =~ m/^>/;#Skip all lines other than the first line of record
    chomp($rec);
#Because location is not always in the same column I use regular expressions (regex)
#instead of split to extract the desired data.
    my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    $loc = 'undefined' unless defined $loc;
    my ($start, $end) = $loc =~ m/\d+/g;
    printf "%-16s%21s%7s%7s\n", ($name, $product, $start, $end);
}
close $fh;

Hi d5e5,
Thank you for your help and support. it run very well. Could you please show me more 2 questions?

1. How can I write more the colums in the same line of your script?

my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;

This my repair :

my ($name, $product, $loc) = $rec =~ m/\[(gene=.+?)\]\s.*\[(protein_id=.+)\]\s(\[.+\]?)/g;
    my ($po) =  $_ =~ m/\[(protein=.+?)\]/g;

2. I have 43 files data to analyse in my project. Do you have the way to run 43 files data on one times?

Thank you very much for your help.

0

If the script reads each line into the $rec variable, then you won't find anything in the $_ variable. my ($po) = $rec =~ m/\[(protein=.+?)\]/g; should work OK.

If you want to read many files, one at a time, you can assign the list of input file names into the @ARGV array and then read from the empty diamond operator.

#!/usr/bin/perl;
use strict;
use warnings;

#Assign any number of file names to Perl's special @ARGV array
@ARGV = qw(file1.txt file2.txt file3.txt);#Example with 3 file names

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

You can get the same result by using a glob pattern to put the list of desired files into @ARGV, which may be easier than typing 43 file names.

#!/usr/bin/perl;
use strict;
use warnings;

@ARGV = <file?.txt>;#Glob list of files matching pattern. ? stands for any character

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

Edited by d5e5: Added glob example

0

If the script reads each line into the $rec variable, then you won't find anything in the $_ variable. my ($po) = $rec =~ m/\[(protein=.+?)\]/g; should work OK.

If you want to read many files, one at a time, you can assign the list of input file names into the @ARGV array and then read from the empty diamond operator.

#!/usr/bin/perl;
use strict;
use warnings;

#Assign any number of file names to Perl's special @ARGV array
@ARGV = qw(file1.txt file2.txt file3.txt);#Example with 3 file names

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

You can get the same result by using a glob pattern to put the list of desired files into @ARGV, which may be easier than typing 43 file names.

#!/usr/bin/perl;
use strict;
use warnings;

@ARGV = <file?.txt>;#Glob list of files matching pattern. ? stands for any character

#Each file automatically opens and closes as script reads the contents
while(my $rec = <>){
    chomp($rec);
    print "$rec\n";
}

Thanks a lot d5e5! It works beautifully!

Edited by biojet: n/a

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.