Hello,

Can someone tell me the equivalent perl script/command for following unix command:

sort -t"|" -k1,1 -T '/temp' input.txt > output.txt

Here, I want mention different physical directory for temprary sort file storage. like - T in unix shell command. In other word, How to mention different workspace directory in the Perl Sort command?

Thanks!

Hello,

Can someone tell me the equivalent perl script/command for following unix command:

sort -t"|" -k1,1 -T '/temp' input.txt > output.txt

Here, I want mention different physical directory for temprary sort file storage. like - T in unix shell command. In other word, How to mention different workspace directory in the Perl Sort command?

Thanks!

As far as I know the sort command in Perl does all the sorting in memory and you cannot specify a temporary directory. If your input file is too large to sort in memory you may consider using a module such as Sort::External. You'll find the docs mention a working_dir parameter.

Thanks for reply. Yes, I need to sort very large size file which is around 5-10 GB. I have gone through documents for sort:external. However, I'm bit confuse on it's implementation. Since, I need sort file based unique key column (3 rd column). Can you please help me out for write perl script using Sort:external command & sort it on 3rd column's field?

I haven't used Sort::External before. I tried altering my previous example of sorting to use Sort::External as follows and it runs OK for me. Of course my example sorts a very small amount of data. It should also work for your large files but I have no idea how long it will take to run.

#!/usr/bin/perl;
use strict;
use warnings;
use Sort::External;

my $sortscheme = sub {
                    my @flds_a = split(/\|/, $Sort::External::a);
                    my @flds_b = split(/\|/, $Sort::External::b);
                    $flds_a[2] cmp $flds_b[2]; #compare key fields to sort
                    };

my $temp_directory = '/home/david/temp';

my $sortex = Sort::External->new(   sortsub         => $sortscheme,
                                    working_dir     => $temp_directory, );

while (<DATA>) {
    chomp;
    $sortex->feed($_);
}

$sortex->finish;

while ( defined( $_ = $sortex->fetch ) ) {
    print "$_\n";
}

__DATA__
1780438|20110709|0000007704000000000000004888|7704|48881|PE|08/12/2008 11:38:54|0|1000.00
1780437|20110708|0000007704000000000000004882|7704|48882|PE|08/12/2008 11:38:54|0|1000.00
1780436|20110707|0000007704000000000000004889|7704|48887|PE|08/11/2008 11:38:54|0|1000.00
1780435|20110703|0000007704000000000000004881|7704|48888|PE|08/12/2008 11:38:54|0|1000.00

If you find that the above takes too long to sort you may want to have a look at the Sort::External::Cookbook which suggests transforming your data before sorting in order to avoid using the $sortscheme block of code. The GRT way described in the Cookbook looks harder to understand than using the subsort parameter (as in the above script) but the GRT way is supposed to run faster.

If you have fixed-width data records so you know your third column always starts at a specified position you can extract the column you want to sort by using the substring function instead of splitting, and assigning the result to an array. By appending the extracted data to the beginning of each record you can sort as strings instead of having to provide a block of code for the sorting logic. The following should run faster for very large input files than the previous script that uses a sortsub.

#!/usr/bin/perl;
use strict;
use warnings;
use Sort::External;

my $temp_directory = '/home/david/temp';

my $sortex = Sort::External->new(working_dir => $temp_directory, );

while (<DATA>) {
    chomp;
    #Encode by extracting the third column (assuming it starts at 17th character)
    #and concatenate it to the start of the record.
    $sortex->feed(substr($_,17,28) . $_);
}

$sortex->finish;

while ( defined( $_ = $sortex->fetch ) ) {
    #Decode
    print substr($_, 28), "\n"; #Remove the extra copy of data from start of record
}

__DATA__
1780438|20110709|0000007704000000000000004888|7704|48881|PE|08/12/2008 11:38:54|0|1000.00
1780437|20110708|0000007704000000000000004882|7704|48882|PE|08/12/2008 11:38:54|0|1000.00
1780436|20110707|0000007704000000000000004889|7704|48887|PE|08/11/2008 11:38:54|0|1000.00
1780435|20110703|0000007704000000000000004881|7704|48888|PE|08/12/2008 11:38:54|0|1000.00

Thanks for this more efficient solution. Here, I have fixed width data records till second field. That's for sure third column is always start at 17th position, however third columns width is not fixed. It can be different. So, here I guess we can use following syntax.
$sortex->feed(substr($_,17,index(substr($_,17),'|')) . $_);
instead of ---
$sortex->feed(substr($_,17,28) . $_);

For decode step ->
print substr($_, index(substr($_,17),'|'))), "\n";

Please confirm me if I'm in right direction. Thanks again!

That should work OK. I think that for variable-width columns I would just do a split, put a delimiter between the sort prefix and the record for the encode step, and then remove everything from start of record up to the first occurrence of your delimiter for the decode step. The biggest gain in efficiency comes from eliminating the block of code that the sort would have to call, so how you extract the column to place at the start of the record doesn't affect the efficiency much and there are several ways of doing it.

One question: since the third column can vary in width, does it always contain only numeric digits? If so, do you want the data compared as a number or as a string of characters? In other words: should '000009' be considered bigger (because it's a bigger number) or smaller (because the first characters of the string are '000' compared to the first three characters of the other string which are '001'. If you want to compare the variable-width data numerically then you will need to make it fixed-width by adding leading zeroes.

Yes, third column could be a number or a characters or combination of both. But, it will have fixed width with leading zeroes for number & leading spaces for character.

Also, while execution above perl script program, I'm getting following error message. Is this Perl/module installation issue? Please suggest me to resolve this issue.

Error message -
Can't locate Sort/External.pm in @INC (@INC contains: /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi

Yes, third column could be a number or a characters or combination of both. But, it will have fixed width with leading zeroes for number & leading spaces for character.

Also, while execution above perl script program, I'm getting following error message. Is this Perl/module installation issue? Please suggest me to resolve this issue.

Error message -
Can't locate Sort/External.pm in @INC (@INC contains: /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi

Yes, you need to install the Sort::External module. Install it in the usual way that you install Perl modules on your system. There are many ways to install modules. I use App::cpanminus to install Perl modules but of course I had to install App::cpanminus before using it the first time to install other Perl modules.

Since you say your third column will already be padded with leading zeroes when numeric then you should be able to use the last script I posted above as long as you know what the maximum width of that column will be. If you know that the desired column starts at position 17 and has a maximum width of 28 then using substring to take your sorting data should work just fine. For data shorter than 28 it doesn't matter if you take extra characters on the right side because the characters on the left side are the most significant for the string comparison done during the sorting. Your only remaining issue is installing the Sort::External module.

Edited 4 Years Ago by d5e5: n/a

This article has been dead for over six months. Start a new discussion instead.