How to sort by line size (number of characters in a line)

fast_rizwaan · 01-08-2010, 10:25 AM

Hi,

I want to sort a number of lines based on their size:

data:
-------
12345678
87654321
1234
4321
123
321
12
21
1
2

Should output as:
-----------------
1
2
12
21
123
321
1234
4321
12345678
87654321

But i'm gettings this with sort
----------------
1
12
123
1234
12345678
2
21
321
4321
87654321
----------------

Can we sort the above "data" text, based on "number of characters" instead of "character order"? a small bash script would also help.

Thanks in advance.

rweaver · 01-08-2010, 10:45 AM

sort -n

Code:

core:~$ cat datafile
12345678
87654321
1234
4321
123
321
12
21
1
2
core:~$ sort datafile
1
12
123
1234
12345678
2
21
321
4321
87654321
core:~$ sort -n datafile
1
2
12
21
123
321
1234
4321
12345678
87654321
core:~$ man sort
SORT(1)                                                                                                      User Commands                                                                                                      SORT(1)

NAME
       sort - sort lines of text files

SYNOPSIS
       sort [OPTION]... [FILE]...

DESCRIPTION
       Write sorted concatenation of all FILE(s) to standard output.

       Mandatory arguments to long options are mandatory for short options too.  Ordering options:

       -b, --ignore-leading-blanks
              ignore leading blanks

       -d, --dictionary-order
              consider only blanks and alphanumeric characters

       -f, --ignore-case
              fold lower case to upper case characters

       -g, --general-numeric-sort
              compare according to general numerical value

       -i, --ignore-nonprinting
              consider only printable characters

       -M, --month-sort
              compare (unknown) < âJANâ < ... < âDECâ

       -n, --numeric-sort
              compare according to string numerical value

       -R, --random-sort
              sort by random hash of keys

       --random-source=FILE
              get random bytes from FILE (default /dev/urandom)

       -r, --reverse
              reverse the result of comparisons

       Other options:

       -c, --check, --check=diagnose-first
              check for sorted input; do not sort

       -C, --check=quiet, --check=silent
              like -c, but do not report first bad line

       --compress-program=PROG
              compress temporaries with PROG; decompress them with PROG -d

       -k, --key=POS1[,POS2]
              start a key at POS1, end it at POS2 (origin 1)

       -m, --merge
              merge already sorted files; do not sort

       -o, --output=FILE
              write result to FILE instead of standard output

       -s, --stable
              stabilize sort by disabling last-resort comparison

       -S, --buffer-size=SIZE
              use SIZE for main memory buffer

       -t, --field-separator=SEP
              use SEP instead of non-blank to blank transition

       -T, --temporary-directory=DIR
              use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories

       -u, --unique
              with -c, check for strict ordering; without -c, output only the first of an equal run

       -z, --zero-terminated
              end lines with 0 byte, not newline

       --help display this help and exit

       --version
              output version information and exit

       POS  is F[.C][OPTS], where F is the field number and C the character position in the field; both are origin 1.  If neither -t nor -b is in effect, characters in a field are counted from the beginning of the preceding whitesâ
       pace.  OPTS is one or more single-letter ordering options, which override global ordering options for that key.  If no key is given, use the entire line as the key.

       SIZE may be followed by the following multiplicative suffixes: % 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.

       With no FILE, or when FILE is -, read standard input.

       *** WARNING *** The locale specified by the environment affects sort order.  Set LC_ALL=C to get the traditional sort order that uses native byte values.

AUTHOR
       Written by Mike Haertel and Paul Eggert.

REPORTING BUGS
       Report bugs to <bug-coreutils@gnu.org>.

COPYRIGHT
       Copyright Â© 2008 Free Software Foundation, Inc.  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
       This is free software: you are free to change and redistribute it.  There is NO WARRANTY, to the extent permitted by law.

SEE ALSO
       The full documentation for sort is maintained as a Texinfo manual.  If the info and sort programs are properly installed at your site, the command

              info sort

       should give you access to the complete manual.

GNU coreutils 6.9.92.4-f088d-dirty                                                                            January 2008                                                                                                      SORT(1)

The -n flag tells it to sort numerically. Most unix programs have a lot of interesting flags that can be used for a variety of functions try 'man programname'

fast_rizwaan · 01-08-2010, 11:28 AM

sort -n won't work with words with "spaces"

example:

$ sort -n
1 2
1 2 3
12345
12 34
12 345
--------

will get us:
---------
1 2
1 2 3
12 34
12 345
12345

Instead I want:
---------------
1 2
1 2 3
12 34 <= same size of 5 digits
12345 <= same size of 5 digits
12 345 <= six digit below 5 digits

-----------
I did see the man sort. but couldn't find the right option which includes "blanks" and sort by sizes (wc -c).

thinking of creating a script with wc-c, sedding etc..

fast_rizwaan · 01-08-2010, 11:39 AM

just got the script working:
----------------------------

data.txt:
---------
1 2
1 2 3
12345
12 34
12 345

sortme.sh
-----------

Quote:

file="data.txt"

for i in `seq $(cat $file|wc -l)` #let's read all lines one by one
do
line="`head -n$i $file|tail -n1`" #get text from line number i
linesize=`echo "$line"|wc -c ` #count number of characters

#let's append the numbers to the line and sort it then get the data out
echo -e "$linesize\t$line"
done

Now needs sorting the data:
---------------------

Quote:

chmod +rx ./sortme; ./sortme.sh |sort -n|cut -f2

output:
------------
1 2
1 2 3
12345
12 34
12 345

rweaver · 01-08-2010, 12:31 PM

There ya go, good solution, but if it has to meet criteria like that you need to specify it up front or we have no idea... just about all types come here complete newbies to professionals.

A shorter solution would be:

Code:

core:~/test/test20$ cat datafile | awk '{print length,$0}' | sort -n | awk ' {$1="";print $0}' | cut -f2- -d' '
1
2
12
21
123
321
1234
4321
1 723
4 234
92 784
12345678
87654321

colucix · 01-08-2010, 02:47 PM

In alternative, what about a little perl one-liner?

Code:

perl -e 'print sort { length $a <=> length $b } <>' data.txt

rweaver · 01-08-2010, 03:01 PM

Honestly, it could probably be shortened to a sed or awk one liner also.

MBybee · 01-08-2010, 03:03 PM

I have a perl script that I wrote for doing this quickly on incredibly huge files - but for normal size stuff 'sort' utility definitely is awesome.

colucix · 01-08-2010, 05:53 PM

I admit I am intrigued by this issue and I wonder if we can do it by means of the sort options. Looking at the info page of sort (that is more exhaustive than the man page) I reached this solution:

Code:

sort $(seq -f "-k1.%0.0f" 100 -1 1) file

in practice it uses multiple -k options, built by command substitution. The resulting command line will be something like:

Code:

sort -k1.100 -k1.99 -k1.98 ... <omitted> ... -k1.2 -k1.1 file

that is they consider always the first field, but starting from a high position other fields are covered to the end of the line (in other words the entire line is considered as the first field, despite the presence of delimiters). The trick is that it sorts starting from the last character of each line back to the first and whereas the Nth character does not exist (shorter lines) the comparison is performed first. That is it orders lines from the shortest to the longest.

In practice we have to choose a number N greater than or equal to the number of characters in the longest line, but taking in mind that the greater is N the longer is the execution time. In my example I chose 100, which was enough for the text files I had at hand for testing.

Anyway, I'm not completely sure it works as expected. My tests are 100% correct but if someone would like to test it and report the result, it would be very appreciated. Just out of my eager curiosity!

Just a final note: the presence of tabs in the text can be confusing since they are considered as single characters, even if they appear as multiple spaces on the terminal screen. To avoid this "optical illusion" we can expand the file before sorting:

Code:

expand file | sort $(seq -f "-k1.%0.0f" 100 -1 1)