LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   bash sort uses too much memory (https://www.linuxquestions.org/questions/linux-general-1/bash-sort-uses-too-much-memory-4175419612/)

trifka 07-31-2012 03:21 PM

bash sort uses too much memory
 
I need to sort a large file (10GB) in bash. I use 'sort -S1G' to control the buffer size to 1GB, but noticed the following strange behavior:

1. When I sort on a key column using general-numeric-sort (e.g sort -k2,2g) sort seems to ignore the -S1G option and tries to read the whole file in memory (and eventually crashes).

2. Same thing happens if I use more than one key column (e.g. sort -k1,1n -k3,3n) and/or when I include the compress-program option.

It seems that the only time when the buffer-size option is respected is when I sort on a single column with no spec for the type (e.g. -k1,1) or a numeric sort (e.g. -k1,1n).

Am I doing something wrong? Is this an expected/documented behavior of bash sort and how I can go around it?

Thanks,
Vladimir

> uname -a
Linux anvil 2.6.32.49-0.3-default #1 SMP 2011-12-02 11:28:04 +0100 x86_64 x86_64 x86_64 GNU/Linux

> sort --version
sort (GNU coreutils) 6.12
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

alinas 07-31-2012 03:49 PM

Without analising the detail, the initial note I make is that your data size is very large. Have you come across named pipes? Could try the following mechanism, to get the best possible throughput and avoid buffering:
- create named pipe called mypipe: mkfifo mypipe
- start (in the background) sort command on the named pipe: sort <required options> mypipe &
- write data to be sorted into the pipe: cat file > mypipe
Named pipe streams output from another process without having to write to an intermediary file. It doesn't buffer, and allows to cut down on physical I/O as well as disk space a real file would take up. Useful when the amount of data that you need to process is larger than the available free disk space. Generally useful when processing large streams of data...

whizje 07-31-2012 04:00 PM

Sort uses the /tmp directory which should have twice the space free as the size of the file you wish to sort.

anomie 07-31-2012 04:16 PM

(Or specify a different/larger temporary file directory with -T.)

trifka 07-31-2012 05:43 PM

Quote:

Originally Posted by anomie (Post 4742485)
(Or specify a different/larger temporary file directory with -T.)

Thanks, anomie. It does not seem that the problem is with the temp folder: I have specified through TMPDIR a temp folder which has more then enough space. The problem seems to be with the behavior of the buffer-size option which sets the amount of memory sort uses before it dumps to temp. Namely, for some settings (e.g. sorting on a single numeric key) sort works fine (the file gets sorted and the temp folder is used properly) while for others (e.g. more than one key and/or sorting with general-numeric-sort) sort tries to read the whole file in memory and crashes (although it has plenty of temp space and the buffer-size is set to only 1G). In the second case it seems like sort ignores the buffer-size option...

trifka 07-31-2012 05:49 PM

Quote:

Originally Posted by alinas (Post 4742470)
Without analising the detail, the initial note I make is that your data size is very large. Have you come across named pipes? Could try the following mechanism, to get the best possible throughput and avoid buffering:
- create named pipe called mypipe: mkfifo mypipe
- start (in the background) sort command on the named pipe: sort <required options> mypipe &
- write data to be sorted into the pipe: cat file > mypipe
Named pipe streams output from another process without having to write to an intermediary file. It doesn't buffer, and allows to cut down on physical I/O as well as disk space a real file would take up. Useful when the amount of data that you need to process is larger than the available free disk space. Generally useful when processing large streams of data...

Thanks, alinas. The behavior of sort with pipes (named or unnamed) seems to be the same: with some options (e.g. sorting on a single numeric key) it works fine, while for others (e.g. many keys) it ignores the buffer-size option and tries to read the whole content of the pipe into memory.


All times are GMT -5. The time now is 07:19 PM.