bash sort uses too much memory
I need to sort a large file (10GB) in bash. I use 'sort -S1G' to control the buffer size to 1GB, but noticed the following strange behavior:
1. When I sort on a key column using general-numeric-sort (e.g sort -k2,2g) sort seems to ignore the -S1G option and tries to read the whole file in memory (and eventually crashes). 2. Same thing happens if I use more than one key column (e.g. sort -k1,1n -k3,3n) and/or when I include the compress-program option. It seems that the only time when the buffer-size option is respected is when I sort on a single column with no spec for the type (e.g. -k1,1) or a numeric sort (e.g. -k1,1n). Am I doing something wrong? Is this an expected/documented behavior of bash sort and how I can go around it? Thanks, Vladimir > uname -a Linux anvil 2.6.32.49-0.3-default #1 SMP 2011-12-02 11:28:04 +0100 x86_64 x86_64 x86_64 GNU/Linux > sort --version sort (GNU coreutils) 6.12 Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Mike Haertel and Paul Eggert. |
Without analising the detail, the initial note I make is that your data size is very large. Have you come across named pipes? Could try the following mechanism, to get the best possible throughput and avoid buffering:
- create named pipe called mypipe: mkfifo mypipe - start (in the background) sort command on the named pipe: sort <required options> mypipe & - write data to be sorted into the pipe: cat file > mypipe Named pipe streams output from another process without having to write to an intermediary file. It doesn't buffer, and allows to cut down on physical I/O as well as disk space a real file would take up. Useful when the amount of data that you need to process is larger than the available free disk space. Generally useful when processing large streams of data... |
Sort uses the /tmp directory which should have twice the space free as the size of the file you wish to sort.
|
(Or specify a different/larger temporary file directory with -T.)
|
Quote:
|
Quote:
|
All times are GMT -5. The time now is 07:19 PM. |