LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 01-08-2010, 10:25 AM   #1
fast_rizwaan
Member
 
Registered: Oct 2002
Location: Hyderabad, India
Distribution: Slackware 10.1
Posts: 34

Rep: Reputation: 15
How to sort by line size (number of characters in a line)


Hi,

I want to sort a number of lines based on their size:

data:
-------
12345678
87654321
1234
4321
123
321
12
21
1
2

Should output as:
-----------------
1
2
12
21
123
321
1234
4321
12345678
87654321

But i'm gettings this with sort
----------------
1
12
123
1234
12345678
2
21
321
4321
87654321
----------------

Can we sort the above "data" text, based on "number of characters" instead of "character order"? a small bash script would also help.

Thanks in advance.
 
Old 01-08-2010, 10:45 AM   #2
rweaver
Senior Member
 
Registered: Dec 2008
Location: Louisville, OH
Distribution: Debian, CentOS, Slackware, RHEL, Gentoo
Posts: 1,833

Rep: Reputation: 163Reputation: 163
sort -n

Code:
core:~$ cat datafile
12345678
87654321
1234
4321
123
321
12
21
1
2
core:~$ sort datafile
1
12
123
1234
12345678
2
21
321
4321
87654321
core:~$ sort -n datafile
1
2
12
21
123
321
1234
4321
12345678
87654321
core:~$ man sort
SORT(1)                                                                                                      User Commands                                                                                                      SORT(1)

NAME
       sort - sort lines of text files

SYNOPSIS
       sort [OPTION]... [FILE]...

DESCRIPTION
       Write sorted concatenation of all FILE(s) to standard output.

       Mandatory arguments to long options are mandatory for short options too.  Ordering options:

       -b, --ignore-leading-blanks
              ignore leading blanks

       -d, --dictionary-order
              consider only blanks and alphanumeric characters

       -f, --ignore-case
              fold lower case to upper case characters

       -g, --general-numeric-sort
              compare according to general numerical value

       -i, --ignore-nonprinting
              consider only printable characters

       -M, --month-sort
              compare (unknown) < ‚JAN‚ < ... < ‚DEC‚

       -n, --numeric-sort
              compare according to string numerical value

       -R, --random-sort
              sort by random hash of keys

       --random-source=FILE
              get random bytes from FILE (default /dev/urandom)

       -r, --reverse
              reverse the result of comparisons

       Other options:

       -c, --check, --check=diagnose-first
              check for sorted input; do not sort

       -C, --check=quiet, --check=silent
              like -c, but do not report first bad line

       --compress-program=PROG
              compress temporaries with PROG; decompress them with PROG -d

       -k, --key=POS1[,POS2]
              start a key at POS1, end it at POS2 (origin 1)

       -m, --merge
              merge already sorted files; do not sort

       -o, --output=FILE
              write result to FILE instead of standard output

       -s, --stable
              stabilize sort by disabling last-resort comparison

       -S, --buffer-size=SIZE
              use SIZE for main memory buffer

       -t, --field-separator=SEP
              use SEP instead of non-blank to blank transition

       -T, --temporary-directory=DIR
              use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories

       -u, --unique
              with -c, check for strict ordering; without -c, output only the first of an equal run

       -z, --zero-terminated
              end lines with 0 byte, not newline

       --help display this help and exit

       --version
              output version information and exit

       POS  is F[.C][OPTS], where F is the field number and C the character position in the field; both are origin 1.  If neither -t nor -b is in effect, characters in a field are counted from the beginning of the preceding whites‚
       pace.  OPTS is one or more single-letter ordering options, which override global ordering options for that key.  If no key is given, use the entire line as the key.

       SIZE may be followed by the following multiplicative suffixes: % 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.

       With no FILE, or when FILE is -, read standard input.

       *** WARNING *** The locale specified by the environment affects sort order.  Set LC_ALL=C to get the traditional sort order that uses native byte values.

AUTHOR
       Written by Mike Haertel and Paul Eggert.

REPORTING BUGS
       Report bugs to <bug-coreutils@gnu.org>.

COPYRIGHT
       Copyright © 2008 Free Software Foundation, Inc.  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
       This is free software: you are free to change and redistribute it.  There is NO WARRANTY, to the extent permitted by law.

SEE ALSO
       The full documentation for sort is maintained as a Texinfo manual.  If the info and sort programs are properly installed at your site, the command

              info sort

       should give you access to the complete manual.

GNU coreutils 6.9.92.4-f088d-dirty                                                                            January 2008                                                                                                      SORT(1)
The -n flag tells it to sort numerically. Most unix programs have a lot of interesting flags that can be used for a variety of functions try 'man programname'

Last edited by rweaver; 01-08-2010 at 10:54 AM.
 
Old 01-08-2010, 11:28 AM   #3
fast_rizwaan
Member
 
Registered: Oct 2002
Location: Hyderabad, India
Distribution: Slackware 10.1
Posts: 34

Original Poster
Rep: Reputation: 15
sort -n won't work with words with "spaces"

example:

$ sort -n
1 2
1 2 3
12345
12 34
12 345
--------

will get us:
---------
1 2
1 2 3
12 34
12 345
12345

Instead I want:
---------------
1 2
1 2 3
12 34 <= same size of 5 digits
12345 <= same size of 5 digits
12 345 <= six digit below 5 digits

-----------
I did see the man sort. but couldn't find the right option which includes "blanks" and sort by sizes (wc -c).

thinking of creating a script with wc-c, sedding etc..
 
Old 01-08-2010, 11:39 AM   #4
fast_rizwaan
Member
 
Registered: Oct 2002
Location: Hyderabad, India
Distribution: Slackware 10.1
Posts: 34

Original Poster
Rep: Reputation: 15
just got the script working:
----------------------------

data.txt:
---------
1 2
1 2 3
12345
12 34
12 345

sortme.sh
-----------
Quote:
file="data.txt"

for i in `seq $(cat $file|wc -l)` #let's read all lines one by one
do
line="`head -n$i $file|tail -n1`" #get text from line number i
linesize=`echo "$line"|wc -c ` #count number of characters

#let's append the numbers to the line and sort it then get the data out
echo -e "$linesize\t$line"
done
Now needs sorting the data:
---------------------

Quote:
chmod +rx ./sortme; ./sortme.sh |sort -n|cut -f2
output:
------------
1 2
1 2 3
12345
12 34
12 345

Last edited by fast_rizwaan; 01-08-2010 at 12:13 PM.
 
Old 01-08-2010, 12:31 PM   #5
rweaver
Senior Member
 
Registered: Dec 2008
Location: Louisville, OH
Distribution: Debian, CentOS, Slackware, RHEL, Gentoo
Posts: 1,833

Rep: Reputation: 163Reputation: 163
There ya go, good solution, but if it has to meet criteria like that you need to specify it up front or we have no idea... just about all types come here complete newbies to professionals.

A shorter solution would be:

Code:
core:~/test/test20$ cat datafile | awk '{print length,$0}' | sort -n | awk ' {$1="";print $0}' | cut -f2- -d' '
1
2
12
21
123
321
1234
4321
1 723
4 234
92 784
12345678
87654321

Last edited by rweaver; 01-08-2010 at 12:39 PM.
 
Old 01-08-2010, 02:47 PM   #6
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,488

Rep: Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956
In alternative, what about a little perl one-liner?
Code:
perl -e 'print sort { length $a <=> length $b } <>' data.txt
 
Old 01-08-2010, 03:01 PM   #7
rweaver
Senior Member
 
Registered: Dec 2008
Location: Louisville, OH
Distribution: Debian, CentOS, Slackware, RHEL, Gentoo
Posts: 1,833

Rep: Reputation: 163Reputation: 163
Honestly, it could probably be shortened to a sed or awk one liner also.
 
Old 01-08-2010, 03:03 PM   #8
MBybee
Member
 
Registered: Jan 2009
Location: wherever I can make a living
Distribution: PC-BSD / FreeBSD / Debian / Ubuntu / Win7 / OpenVMS
Posts: 438

Rep: Reputation: 57
I have a perl script that I wrote for doing this quickly on incredibly huge files - but for normal size stuff 'sort' utility definitely is awesome.
 
Old 01-08-2010, 05:53 PM   #9
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,488

Rep: Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956
I admit I am intrigued by this issue and I wonder if we can do it by means of the sort options. Looking at the info page of sort (that is more exhaustive than the man page) I reached this solution:
Code:
sort $(seq -f "-k1.%0.0f" 100 -1 1) file
in practice it uses multiple -k options, built by command substitution. The resulting command line will be something like:
Code:
sort -k1.100 -k1.99 -k1.98 ... <omitted> ... -k1.2 -k1.1 file
that is they consider always the first field, but starting from a high position other fields are covered to the end of the line (in other words the entire line is considered as the first field, despite the presence of delimiters). The trick is that it sorts starting from the last character of each line back to the first and whereas the Nth character does not exist (shorter lines) the comparison is performed first. That is it orders lines from the shortest to the longest.

In practice we have to choose a number N greater than or equal to the number of characters in the longest line, but taking in mind that the greater is N the longer is the execution time. In my example I chose 100, which was enough for the text files I had at hand for testing.

Anyway, I'm not completely sure it works as expected. My tests are 100% correct but if someone would like to test it and report the result, it would be very appreciated. Just out of my eager curiosity!

Just a final note: the presence of tabs in the text can be confusing since they are considered as single characters, even if they appear as multiple spaces on the terminal screen. To avoid this "optical illusion" we can expand the file before sorting:
Code:
expand file | sort $(seq -f "-k1.%0.0f" 100 -1 1)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
End-of-line Characters missing from last line of md5 file. Md5sum fails mehorter Linux - General 5 06-29-2009 08:56 PM
bash : read every line from text file starting at given line number quadmore Programming 4 02-20-2009 12:29 PM
Removing new line characters on every line execpt first line bioinformatics_guy Linux - Newbie 4 10-21-2008 12:41 PM
Is there a line limit with the sort utility? Trying to sort 130 million lines of text gruffy Linux - General 4 08-10-2006 08:40 PM
51 characters only in the 1st Line of command line eggCover Linux - General 2 07-29-2004 01:28 PM


All times are GMT -5. The time now is 10:39 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration