Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I need to sort a large file (10GB) in bash. I use 'sort -S1G' to control the buffer size to 1GB, but noticed the following strange behavior:
1. When I sort on a key column using general-numeric-sort (e.g sort -k2,2g) sort seems to ignore the -S1G option and tries to read the whole file in memory (and eventually crashes).
2. Same thing happens if I use more than one key column (e.g. sort -k1,1n -k3,3n) and/or when I include the compress-program option.
It seems that the only time when the buffer-size option is respected is when I sort on a single column with no spec for the type (e.g. -k1,1) or a numeric sort (e.g. -k1,1n).
Am I doing something wrong? Is this an expected/documented behavior of bash sort and how I can go around it?
Thanks,
Vladimir
> uname -a
Linux anvil 2.6.32.49-0.3-default #1 SMP 2011-12-02 11:28:04 +0100 x86_64 x86_64 x86_64 GNU/Linux
> sort --version
sort (GNU coreutils) 6.12
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Without analising the detail, the initial note I make is that your data size is very large. Have you come across named pipes? Could try the following mechanism, to get the best possible throughput and avoid buffering:
- create named pipe called mypipe: mkfifo mypipe
- start (in the background) sort command on the named pipe: sort <required options> mypipe &
- write data to be sorted into the pipe: cat file > mypipe
Named pipe streams output from another process without having to write to an intermediary file. It doesn't buffer, and allows to cut down on physical I/O as well as disk space a real file would take up. Useful when the amount of data that you need to process is larger than the available free disk space. Generally useful when processing large streams of data...
(Or specify a different/larger temporary file directory with -T.)
Thanks, anomie. It does not seem that the problem is with the temp folder: I have specified through TMPDIR a temp folder which has more then enough space. The problem seems to be with the behavior of the buffer-size option which sets the amount of memory sort uses before it dumps to temp. Namely, for some settings (e.g. sorting on a single numeric key) sort works fine (the file gets sorted and the temp folder is used properly) while for others (e.g. more than one key and/or sorting with general-numeric-sort) sort tries to read the whole file in memory and crashes (although it has plenty of temp space and the buffer-size is set to only 1G). In the second case it seems like sort ignores the buffer-size option...
Without analising the detail, the initial note I make is that your data size is very large. Have you come across named pipes? Could try the following mechanism, to get the best possible throughput and avoid buffering:
- create named pipe called mypipe: mkfifo mypipe
- start (in the background) sort command on the named pipe: sort <required options> mypipe &
- write data to be sorted into the pipe: cat file > mypipe
Named pipe streams output from another process without having to write to an intermediary file. It doesn't buffer, and allows to cut down on physical I/O as well as disk space a real file would take up. Useful when the amount of data that you need to process is larger than the available free disk space. Generally useful when processing large streams of data...
Thanks, alinas. The behavior of sort with pipes (named or unnamed) seems to be the same: with some options (e.g. sorting on a single numeric key) it works fine, while for others (e.g. many keys) it ignores the buffer-size option and tries to read the whole content of the pipe into memory.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.