[SOLVED] [Perl] fail to sort a file with 300,000 lines by multiple columns

Kunsheng · 11-10-2009, 08:46 PM

Each line of the file I am sorting is in the following format:

<url> <month> <day>

For example:

http://www.google.com 10 3

I wrote the following to sort:

Code:

#!/usr/bin/perl

$in = shift;

chomp($in);

open(INFILE, "<$in") || die "Can't open file";
chomp(@fields = <INFILE>); # slurp in the file
close(INFILE);


@sorted =
map { $_->[0] }
sort { $a->[2] <=> $b->[2] || #sort by date
$a->[1] <=> $b->[1] # sort by month
}
map { [ $_ , (split /\s+/) ] } @fields;#tab separated fields



foreach (@sorted)
{
    print "$_\n";
}

The script worked fine for my small testing files, but failed in my input file. The input file is 18MB and containing more than 300,000 lines.

The output will contains some lines like that:

-----------------
url_one 10 1
url_two 10 1
url_three 10 3
url_four 10 1

----------------

Is that because my file is too big for perl to handle ?

Any idea is well appreciated,

-Thanks,

-Kun

nadroj · 11-10-2009, 11:18 PM

Perl is designed to handle and process large amounts of data, so I imagine your 18MB file isnt "breaking" Perl. Actually, I'd bet money it isnt, because Ive written Perl scripts to process large files (GBs in size).

Quote:

The script worked fine for my small testing files, but failed in my input file.

What does "fail" mean? Does it print anything when it fails? Add simple print statements to "debug" the code, to see what it gets up to before it fails. Can you use a subset of the 18MB file to verify it works? That is, make a copy of the file, then cut it down, say 10,000, 50,000, 100,000, etc, lines, to see at what point it doesnt work correctly--the threshold. You said it works on "small" files, but, again, we dont know what "small" means. Have you already tried this technique?

ghostdog74 · 11-10-2009, 11:40 PM

@OP, if its not a must to use Perl, you can try using the normal GNU sort command.

Kunsheng · 11-11-2009, 12:43 PM

Just found out the program is not working for new small files (one with 100 lines and have real url inside). My previous testing files were only supposing URL as a simple string only, it was working for files with 10-20 lines.

So is any problem with my program according to the code ?

Quote:

Originally Posted by nadroj

Perl is designed to handle and process large amounts of data, so I imagine your 18MB file isnt "breaking" Perl. Actually, I'd bet money it isnt, because Ive written Perl scripts to process large files (GBs in size).

What does "fail" mean? Does it print anything when it fails? Add simple print statements to "debug" the code, to see what it gets up to before it fails. Can you use a subset of the 18MB file to verify it works? That is, make a copy of the file, then cut it down, say 10,000, 50,000, 100,000, etc, lines, to see at what point it doesnt work correctly--the threshold. You said it works on "small" files, but, again, we dont know what "small" means. Have you already tried this technique?

Kunsheng · 11-11-2009, 01:18 PM

I tried 'sort +1 +2 [my_file]' but it reminds me 'sort: open failed: +1: No such file or directory'.

And 'sort -k2 -k3 [my_file]' could execute but the result is not correct either..

Yet I do have three columns inside the file (separate by ' ').

Quote:

Originally Posted by ghostdog74

@OP, if its not a must to use Perl, you can try using the normal GNU sort command.

ghostdog74 · 11-11-2009, 05:36 PM

show a few more input samples, and show your desired output when sorted.

Telemachos · 11-11-2009, 07:20 PM

Rather than slurp the file, I would read it in line by line, build up the @fields array a line at a time and then sort it. 18MB is not the end of the world, but as filesize grows, it gets less and less smart to slurp.

Kunsheng · 11-12-2009, 10:59 AM

A piece of file is like below (It has been sorted once by the program):

the input and output files are of same content but different order.

The last two fields are month and date in numeric number.

Basically I want to sort them by month then by date,my program seem to succeed in sorting month but not all the date..

Code:

http://www.amazon.com/Lawn-Garden-Tools-Hardware/b/ref=sa_menu_outequip11/191-6429805-3838363 9 5
http://www.amazon.com/gp/help/customer/display.html/ref=hy_f_3/191-6429805-3838363 9 5
http://www.amazon.com/Kindle-Accessories/b/ref=sa_menu_kacces3/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=amb_link_7395972_64/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=sc_bm_br_16310101_1_84/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=sc_bm_br_3386071_17_mo_1/191-6429805-3838363 9 6
http://www.amazon.com/s/ref=amb_link_5620242_2/191-6429805-3838363 10 4
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 10 3

Telemachos · 11-13-2009, 07:38 AM

You want a Schwartzian transform, I think. Here's how I might do it:

An input file (out of order):

Code:

http://www.amazon.com/b/ref=amb_link_7395972_64/191-6429805-3838363 9 6
http://www.amazon.com/Lawn-Garden-Tools-Hardware/b/ref=sa_menu_outequip11/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=sc_bm_br_16310101_1_84/191-6429805-3838363 9 6
http://www.amazon.com/gp/help/customer/display.html/ref=hy_f_3/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=sc_bm_br_3386071_17_mo_1/191-6429805-3838363 9 6
http://www.amazon.com/s/ref=amb_link_5620242_2/191-6429805-3838363 10 4
http://www.amazon.com/Kindle-Accessories/b/ref=sa_menu_kacces3/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 10 3
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 8 3

Parser:

Code:

#!/usr/bin/env perl
use strict;
use warnings;

my @records;

while (<>) {
    chomp;
    push @records, $_;
}

my @sorted_records =    map  { $_->[0] }
                        sort { $a->[1][0] <=> $b->[1][0]
                                          ||
                               $a->[1][1] <=> $b->[1][1] }
                        map  { [$_, [(split / /, $_)[1,2]]] } @records;

print "$_\n" for @sorted_records;

Output:

Code:

hektor ~ $ perl parser file.txt
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 8 3
http://www.amazon.com/Lawn-Garden-Tools-Hardware/b/ref=sa_menu_outequip11/191-6429805-3838363 9 5
http://www.amazon.com/gp/help/customer/display.html/ref=hy_f_3/191-6429805-3838363 9 5
http://www.amazon.com/Kindle-Accessories/b/ref=sa_menu_kacces3/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=amb_link_7395972_64/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=sc_bm_br_16310101_1_84/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=sc_bm_br_3386071_17_mo_1/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 10 3
http://www.amazon.com/s/ref=amb_link_5620242_2/191-6429805-3838363 10 4

Kunsheng · 11-13-2009, 02:24 PM

Thanks a lot! Telemachos! It works like a charm!

Possible give me some explaination about that ? (or what was wrong with my previous program ?) The prev program I was using was introduced as working example according to many sites although it is not... Also they mentioned the prev program were using 'Schwartzian transform'..

Telemachos · 11-13-2009, 06:41 PM

I have to be honest, I came to this post a bit late and didn't even look at the first Perl version you had posted. (I knew it wasn't working, so I wrote something that I thought did work.) So I didn't see that there was in fact a Schwartzian transform there.

There were two basic problems with the script you started with: first, it was sorting by fields 2 and 1 - even though field 1 was the url and second, it was trying to sort by date before month. My guess is that the script you found was written for similar, but not identical records.

As for what my script does, here's the breakdown. I put the explanations into comments in the file. Hope this helps:

Code:

#!/usr/bin/env perl
use strict;
use warnings;

# create a @records array to hold the lines from the file
my @records;

# run through the file line by line, remove the newline from each line and then
# stuff the line into the @records array
while (<>) {
    chomp;
    push @records, $_;
}

# the complicated bit: you have to read this backwards - the first thing that
# happens is the map on the last line of this block; that map creates an anonymous array,
# made up of the whole line (the 0th element) and a sub anonymous array (the 1st element);
# the sub-array has the month (the 0th element) and the date (the 1st element);
# next, we do the sort: the sort compares the months first; if the months are equal, we
# fall past the or (||) and compare the dates; the array that comes through there is sorted,
# and the final map selects and passes forward only the original line (the 0th element), now
# in sorted order
#
# whew
my @sorted_records =    map  { $_->[0] }
                        sort { $a->[1][0] <=> $b->[1][0]
                                          ||
                               $a->[1][1] <=> $b->[1][1] }
                        map  { [$_, [(split / /, $_)[1,2]]] } @records;

# print the sorted records
print "$_\n" for @sorted_records;