LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 07-31-2012, 03:21 PM   #1
trifka
LQ Newbie
 
Registered: Jul 2012
Posts: 3

Rep: Reputation: Disabled
bash sort uses too much memory


I need to sort a large file (10GB) in bash. I use 'sort -S1G' to control the buffer size to 1GB, but noticed the following strange behavior:

1. When I sort on a key column using general-numeric-sort (e.g sort -k2,2g) sort seems to ignore the -S1G option and tries to read the whole file in memory (and eventually crashes).

2. Same thing happens if I use more than one key column (e.g. sort -k1,1n -k3,3n) and/or when I include the compress-program option.

It seems that the only time when the buffer-size option is respected is when I sort on a single column with no spec for the type (e.g. -k1,1) or a numeric sort (e.g. -k1,1n).

Am I doing something wrong? Is this an expected/documented behavior of bash sort and how I can go around it?

Thanks,
Vladimir

> uname -a
Linux anvil 2.6.32.49-0.3-default #1 SMP 2011-12-02 11:28:04 +0100 x86_64 x86_64 x86_64 GNU/Linux

> sort --version
sort (GNU coreutils) 6.12
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.
 
Old 07-31-2012, 03:49 PM   #2
alinas
Member
 
Registered: Apr 2002
Location: UK, Sywell, EGBK
Distribution: RHEL, SuSE, CentOS, Debian, Ubuntu
Posts: 60

Rep: Reputation: 20
Without analising the detail, the initial note I make is that your data size is very large. Have you come across named pipes? Could try the following mechanism, to get the best possible throughput and avoid buffering:
- create named pipe called mypipe: mkfifo mypipe
- start (in the background) sort command on the named pipe: sort <required options> mypipe &
- write data to be sorted into the pipe: cat file > mypipe
Named pipe streams output from another process without having to write to an intermediary file. It doesn't buffer, and allows to cut down on physical I/O as well as disk space a real file would take up. Useful when the amount of data that you need to process is larger than the available free disk space. Generally useful when processing large streams of data...
 
Old 07-31-2012, 04:00 PM   #3
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
Sort uses the /tmp directory which should have twice the space free as the size of the file you wish to sort.
 
Old 07-31-2012, 04:16 PM   #4
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora
Posts: 3,935
Blog Entries: 5

Rep: Reputation: Disabled
(Or specify a different/larger temporary file directory with -T.)
 
Old 07-31-2012, 05:43 PM   #5
trifka
LQ Newbie
 
Registered: Jul 2012
Posts: 3

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by anomie View Post
(Or specify a different/larger temporary file directory with -T.)
Thanks, anomie. It does not seem that the problem is with the temp folder: I have specified through TMPDIR a temp folder which has more then enough space. The problem seems to be with the behavior of the buffer-size option which sets the amount of memory sort uses before it dumps to temp. Namely, for some settings (e.g. sorting on a single numeric key) sort works fine (the file gets sorted and the temp folder is used properly) while for others (e.g. more than one key and/or sorting with general-numeric-sort) sort tries to read the whole file in memory and crashes (although it has plenty of temp space and the buffer-size is set to only 1G). In the second case it seems like sort ignores the buffer-size option...
 
Old 07-31-2012, 05:49 PM   #6
trifka
LQ Newbie
 
Registered: Jul 2012
Posts: 3

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by alinas View Post
Without analising the detail, the initial note I make is that your data size is very large. Have you come across named pipes? Could try the following mechanism, to get the best possible throughput and avoid buffering:
- create named pipe called mypipe: mkfifo mypipe
- start (in the background) sort command on the named pipe: sort <required options> mypipe &
- write data to be sorted into the pipe: cat file > mypipe
Named pipe streams output from another process without having to write to an intermediary file. It doesn't buffer, and allows to cut down on physical I/O as well as disk space a real file would take up. Useful when the amount of data that you need to process is larger than the available free disk space. Generally useful when processing large streams of data...
Thanks, alinas. The behavior of sort with pipes (named or unnamed) seems to be the same: with some options (e.g. sorting on a single numeric key) it works fine, while for others (e.g. many keys) it ignores the buffer-size option and tries to read the whole content of the pipe into memory.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] using sort and uniq in bash bibiki Linux - Newbie 2 02-19-2011 10:12 AM
bash sort problem ArthurHuang Programming 4 05-02-2009 05:20 PM
bash regex - sort czezz Programming 3 02-05-2009 07:41 PM
Memory Sort LinuGeek Other *NIX 1 10-20-2008 05:32 PM
External Merge Sort - in memory sorting sachitha Programming 0 10-15-2006 12:11 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 03:42 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration