LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-10-2009, 08:46 PM   #1
Kunsheng
Member
 
Registered: Mar 2009
Posts: 82

Rep: Reputation: 16
[Perl] fail to sort a file with 300,000 lines by multiple columns


Each line of the file I am sorting is in the following format:

<url> <month> <day>

For example:


http://www.google.com 10 3

I wrote the following to sort:

Code:
#!/usr/bin/perl

$in = shift;

chomp($in);

open(INFILE, "<$in") || die "Can't open file";
chomp(@fields = <INFILE>); # slurp in the file
close(INFILE);


@sorted =
map { $_->[0] }
sort { $a->[2] <=> $b->[2] || #sort by date
$a->[1] <=> $b->[1] # sort by month
}
map { [ $_ , (split /\s+/) ] } @fields;#tab separated fields



foreach (@sorted)
{
    print "$_\n";
}
The script worked fine for my small testing files, but failed in my input file. The input file is 18MB and containing more than 300,000 lines.

The output will contains some lines like that:

-----------------
url_one 10 1
url_two 10 1
url_three 10 3
url_four 10 1

----------------


Is that because my file is too big for perl to handle ?

Any idea is well appreciated,

-Thanks,

-Kun
 
Old 11-10-2009, 11:18 PM   #2
nadroj
Senior Member
 
Registered: Jan 2005
Location: Canada
Distribution: ubuntu
Posts: 2,539

Rep: Reputation: 60
Perl is designed to handle and process large amounts of data, so I imagine your 18MB file isnt "breaking" Perl. Actually, I'd bet money it isnt, because Ive written Perl scripts to process large files (GBs in size).

Quote:
The script worked fine for my small testing files, but failed in my input file.
What does "fail" mean? Does it print anything when it fails? Add simple print statements to "debug" the code, to see what it gets up to before it fails. Can you use a subset of the 18MB file to verify it works? That is, make a copy of the file, then cut it down, say 10,000, 50,000, 100,000, etc, lines, to see at what point it doesnt work correctly--the threshold. You said it works on "small" files, but, again, we dont know what "small" means. Have you already tried this technique?
 
Old 11-10-2009, 11:40 PM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
@OP, if its not a must to use Perl, you can try using the normal GNU sort command.
 
Old 11-11-2009, 12:43 PM   #4
Kunsheng
Member
 
Registered: Mar 2009
Posts: 82

Original Poster
Rep: Reputation: 16
Just found out the program is not working for new small files (one with 100 lines and have real url inside). My previous testing files were only supposing URL as a simple string only, it was working for files with 10-20 lines.

So is any problem with my program according to the code ?


Quote:
Originally Posted by nadroj View Post
Perl is designed to handle and process large amounts of data, so I imagine your 18MB file isnt "breaking" Perl. Actually, I'd bet money it isnt, because Ive written Perl scripts to process large files (GBs in size).

What does "fail" mean? Does it print anything when it fails? Add simple print statements to "debug" the code, to see what it gets up to before it fails. Can you use a subset of the 18MB file to verify it works? That is, make a copy of the file, then cut it down, say 10,000, 50,000, 100,000, etc, lines, to see at what point it doesnt work correctly--the threshold. You said it works on "small" files, but, again, we dont know what "small" means. Have you already tried this technique?
 
Old 11-11-2009, 01:18 PM   #5
Kunsheng
Member
 
Registered: Mar 2009
Posts: 82

Original Poster
Rep: Reputation: 16
I tried 'sort +1 +2 [my_file]' but it reminds me 'sort: open failed: +1: No such file or directory'.

And 'sort -k2 -k3 [my_file]' could execute but the result is not correct either..

Yet I do have three columns inside the file (separate by ' ').

Quote:
Originally Posted by ghostdog74 View Post
@OP, if its not a must to use Perl, you can try using the normal GNU sort command.
 
Old 11-11-2009, 05:36 PM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
show a few more input samples, and show your desired output when sorted.
 
Old 11-11-2009, 07:20 PM   #7
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 60
Rather than slurp the file, I would read it in line by line, build up the @fields array a line at a time and then sort it. 18MB is not the end of the world, but as filesize grows, it gets less and less smart to slurp.
 
Old 11-12-2009, 10:59 AM   #8
Kunsheng
Member
 
Registered: Mar 2009
Posts: 82

Original Poster
Rep: Reputation: 16
A piece of file is like below (It has been sorted once by the program):

the input and output files are of same content but different order.

The last two fields are month and date in numeric number.

Basically I want to sort them by month then by date,my program seem to succeed in sorting month but not all the date..

Code:
http://www.amazon.com/Lawn-Garden-Tools-Hardware/b/ref=sa_menu_outequip11/191-6429805-3838363 9 5
http://www.amazon.com/gp/help/customer/display.html/ref=hy_f_3/191-6429805-3838363 9 5
http://www.amazon.com/Kindle-Accessories/b/ref=sa_menu_kacces3/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=amb_link_7395972_64/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=sc_bm_br_16310101_1_84/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=sc_bm_br_3386071_17_mo_1/191-6429805-3838363 9 6
http://www.amazon.com/s/ref=amb_link_5620242_2/191-6429805-3838363 10 4
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 10 3

Last edited by Kunsheng; 11-12-2009 at 11:06 AM.
 
Old 11-13-2009, 07:38 AM   #9
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 60
You want a Schwartzian transform, I think. Here's how I might do it:

An input file (out of order):
Code:
http://www.amazon.com/b/ref=amb_link_7395972_64/191-6429805-3838363 9 6
http://www.amazon.com/Lawn-Garden-Tools-Hardware/b/ref=sa_menu_outequip11/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=sc_bm_br_16310101_1_84/191-6429805-3838363 9 6
http://www.amazon.com/gp/help/customer/display.html/ref=hy_f_3/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=sc_bm_br_3386071_17_mo_1/191-6429805-3838363 9 6
http://www.amazon.com/s/ref=amb_link_5620242_2/191-6429805-3838363 10 4
http://www.amazon.com/Kindle-Accessories/b/ref=sa_menu_kacces3/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 10 3
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 8 3
Parser:
Code:
#!/usr/bin/env perl
use strict;
use warnings;

my @records;

while (<>) {
    chomp;
    push @records, $_;
}

my @sorted_records =    map  { $_->[0] }
                        sort { $a->[1][0] <=> $b->[1][0]
                                          ||
                               $a->[1][1] <=> $b->[1][1] }
                        map  { [$_, [(split / /, $_)[1,2]]] } @records;

print "$_\n" for @sorted_records;
Output:
Code:
hektor ~ $ perl parser file.txt
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 8 3
http://www.amazon.com/Lawn-Garden-Tools-Hardware/b/ref=sa_menu_outequip11/191-6429805-3838363 9 5
http://www.amazon.com/gp/help/customer/display.html/ref=hy_f_3/191-6429805-3838363 9 5
http://www.amazon.com/Kindle-Accessories/b/ref=sa_menu_kacces3/191-6429805-3838363 9 5
http://www.amazon.com/b/ref=amb_link_7395972_64/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=sc_bm_br_16310101_1_84/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=sc_bm_br_3386071_17_mo_1/191-6429805-3838363 9 6
http://www.amazon.com/b/ref=amb_link_84762451_4/191-6429805-3838363 10 3
http://www.amazon.com/s/ref=amb_link_5620242_2/191-6429805-3838363 10 4
 
Old 11-13-2009, 02:24 PM   #10
Kunsheng
Member
 
Registered: Mar 2009
Posts: 82

Original Poster
Rep: Reputation: 16
Thanks a lot! Telemachos! It works like a charm!

Possible give me some explaination about that ? (or what was wrong with my previous program ?) The prev program I was using was introduced as working example according to many sites although it is not... Also they mentioned the prev program were using 'Schwartzian transform'..

Last edited by Kunsheng; 11-13-2009 at 02:28 PM.
 
Old 11-13-2009, 06:41 PM   #11
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 60
I have to be honest, I came to this post a bit late and didn't even look at the first Perl version you had posted. (I knew it wasn't working, so I wrote something that I thought did work.) So I didn't see that there was in fact a Schwartzian transform there.

There were two basic problems with the script you started with: first, it was sorting by fields 2 and 1 - even though field 1 was the url and second, it was trying to sort by date before month. My guess is that the script you found was written for similar, but not identical records.

As for what my script does, here's the breakdown. I put the explanations into comments in the file. Hope this helps:
Code:
#!/usr/bin/env perl
use strict;
use warnings;

# create a @records array to hold the lines from the file
my @records;

# run through the file line by line, remove the newline from each line and then
# stuff the line into the @records array
while (<>) {
    chomp;
    push @records, $_;
}

# the complicated bit: you have to read this backwards - the first thing that
# happens is the map on the last line of this block; that map creates an anonymous array,
# made up of the whole line (the 0th element) and a sub anonymous array (the 1st element);
# the sub-array has the month (the 0th element) and the date (the 1st element);
# next, we do the sort: the sort compares the months first; if the months are equal, we
# fall past the or (||) and compare the dates; the array that comes through there is sorted,
# and the final map selects and passes forward only the original line (the 0th element), now
# in sorted order
#
# whew
my @sorted_records =    map  { $_->[0] }
                        sort { $a->[1][0] <=> $b->[1][0]
                                          ||
                               $a->[1][1] <=> $b->[1][1] }
                        map  { [$_, [(split / /, $_)[1,2]]] } @records;

# print the sorted records
print "$_\n" for @sorted_records;
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sort by multiple columns wakatana Linux - Newbie 5 10-18-2009 03:35 PM
How do I get lines 30.000 to 40.000 from an Apache access_log file? Ujjain Linux - Newbie 5 03-27-2009 05:37 AM
MySQL - can it handle a database with 300,000,000 entries? Micro420 Linux - Software 4 03-22-2007 12:22 AM
[Perl] append columns to file noir911 Programming 3 02-08-2007 05:29 AM
How can I sort the lines in a file? windhair Linux - Software 2 11-17-2005 08:37 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:14 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration