LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Blogs > Here are some things which I hope will be helpful.
User Name
Password

Notices

Any code I post here should be considered experimental and unfinished. Don't use it in a production environment. It is your own responsibility to evaluate the code's fitness for any purpose. Programming isn't done in a vacuum, so be prepared to do your own research and teach yourself to do better.

Most importantly, all your polite critiques, elaborations, and corrections are heartily welcomed.
Rate this Entry

Selecting fields from a line using "cut"

Posted 12-22-2011 at 06:56 PM by Telengard
Updated 12-26-2011 at 10:26 PM by Telengard
Tags cut

using cut to break down text

You can use cut to grab only the data you want from a line of text. Tell cut which groups of characters you are interested in and it will print them. By default cut uses tab characters to break down groups of text from the input file.

NOTE: In the example data files which follow, the strings <TAB> and <SPACE> shall be used to visually distinguish whitespace characters. A tab character is the character generated when you press the Tab key on your computer's keyboard, and is assigned the value 0x09 on the ASCII chart. A space character is the character generated when you press the Space bar key on your computer's keyboard, and is assigned the value 0x20 on the ASCII chart.

Let us begin with a sample input file to be processed by cut. This file makes good use of cut's default case. Using tab field delimiters allows us to include normal space characters within fields.

Code:
Ohio<TAB>Columbus<TAB>March 1, 1803
California<TAB>Sacremento<TAB>September 9, 1850
Texas<TAB>Austin<TAB>December 29, 1845
(NOTE: Remember that each occurrence of <TAB> is a single tab character.)

Fields are counted by cut beginning with 1. Each instance of the field delimiter (tab by default) increments the field number. If we want a quick list of any single field, we can select it with cut --fields=num, where num is the desired field number.

Code:
$ cut --fields=2 tab-delimited-fields.txt
Columbus
Sacremento
Austin
Specifying fields

We can output more than one field by specifying a list of fields we want as num1,num2....

Code:
$ cut --fields=1,3 tab-delimited-fields.txt
Ohio    March 1, 1803
California      September 9, 1850
Texas   December 29, 1845
We can also specify a range of fields to print by putting a hyphen between the first desired field and the last.

Code:
$ cut --fields=1-2 tab-delimited-fields.txt
Ohio    Columbus
California      Sacremento
Texas   Austin
To get a more explicit view of how cut is breaking the lines into fields, we can specify a unique character as an output delimiter. In the following example, the option --output-delimiter='@' tells cut to output @ instead of the literal field delimiters (tab characters).

Code:
$ cut --fields=1,2,3 --output-delimiter='@' tab-delimited-fields.txt
Ohio@Columbus@March 1, 1803
California@Sacremento@September 9, 1850
Texas@Austin@December 29, 1845
Other field delimiters

Set a new field delimiter with --delimiter=char, where char is any single character. Similar to changing the output field delimiter, this option changes the input field delimiter. Input fields can be separated by any single unique character.

Code:
$ cat dot-delimited-fields.txt
Ohio.Columbus.March 1, 1803
California.Sacremento.September 9, 1850
Texas.Austin.December 29, 1845
Code:
$ cut --delimiter='.' --fields=2 dot-delimited-fields.txt
Columbus
Sacremento
Austin
Space delimited fields

Code:
Ohio<SPACE>Columbus<SPACE>March<SPACE>1,<SPACE>1803
California<SPACE>Sacremento<SPACE>September<SPACE>9,<SPACE>1850
Texas<SPACE>Austin<SPACE>December<SPACE>29,<SPACE>1845
(NOTE: Remember that each occurrence of <SPACE> is a single space character.)

Now there is some ambiguity as to which space characters constitute field delimiters. Guess what happens when we print field 3.

Code:
$ cut --delimiter=' ' --fields=3 space-delimited-fields.txt
March
September
December
The spaces in the date field cause it to be split into three fields instead of one. Now we have five fields to deal with instead of only three.

Code:
$ cut --delimiter=' ' --output-delimiter='@' --fields=1-5 space-delimited-fields.txt
Ohio@Columbus@March@1,@1803
California@Sacremento@September@9,@1850
Texas@Austin@December@29,@1845
If the last field is omitted from a range, cut assumes we want all subsequent fields.

Code:
$ cut --delimiter=' ' --fields=3- space-delimited-fields.txt
March 1, 1803
September 9, 1850
December 29, 1845
Varialbe length field separators

Now we're running into the limits of cut's usefulness. It counts each instance of the input field delimiter as a new field. Suppose we have a file where each field is separated by varible numbers of space characters. This can be useful to align data into neat columns.

Code:
Ohio<SPACE><SPACE><SPACE><SPACE><SPACE><SPACE><SPACE>Columbus<SPACE><SPACE><SPACE>March<SPACE>1,<SPACE>1803
California<SPACE>Sacremento<SPACE>September<SPACE>9,<SPACE>1850
Texas<SPACE><SPACE><SPACE><SPACE><SPACE><SPACE>Austin<SPACE><SPACE><SPACE><SPACE><SPACE>December<SPACE>29,<SPACE>1845
When the file is printed in monospace, we get neatly aligned columns.

Code:
$ cat variable-space-delimited-fields.txt
Ohio       Columbus   March 1, 1803
California Sacremento September 9, 1850
Texas      Austin     December 29, 1845
What happens when we print field 2?

Code:
$ cut --delimiter=' ' --fields=2 variable-space-delimited-fields.txt

Sacremento
Field 2 is now empty on lines 1 and 3. Let's set the output delimiter to a unique character to easily see where field 2 is.

Code:
$ cut --delimiter=' ' --fields=1- --output-delimiter='@' variable-space-delimited-fields.txt
Ohio@@@@@@@Columbus@@@March@1,@1803
California@Sacremento@September@9,@1850
Texas@@@@@@Austin@@@@@December@29,@1845
Field 2 is empty because each instance of the delimiter counts as a new field. We could work around this by invoking cut within a Bash loop. Bash has facilities to test for empty strings. Let's try.

Code:
$ while read line; do i=2; while [[ -z "$(echo "$line" | cut --delimiter=' ' --fields=$i)" ]]; do ((i++)); done; echo "$line" | cut --delimiter=' ' --fields=$i; done < variable-space-delimited-fields.txt
Columbus
Sacremento
Austin
This solution is ugly, clumsy, and inefficient. Any solution this complicated isn't likely to be reliable unless input data is strictly controlled. Printing more than one field per line would now be a daunting task.

There must be a better way. Tune in for the next article in this series to find out


Addendum

Of course there are other solutions to the variable length separators problem. Just massage the data before passing it to cut. The tr utility can do the job quite handily with its --squeeze-repeats option.

Code:
$ tr --squeeze-repeats ' ' < variable-space-delimited-fields.txt
Ohio Columbus March 1, 1803
California Sacremento September 9, 1850
Texas Austin December 29, 1845
This effectively reduces all runs of the delimiter to a single instance. Now the fields are easily accessible to cut.

Code:
$ tr --squeeze-repeats ' ' < variable-space-delimited-fields.txt | cut --delimiter=' ' --fields=2
Columbus
Sacremento
Austin
Thanks to the LQWiki article for reminding me of this technique. The point remains that cut can't get the job done without some help.
Views 3047 Comments 0
« Prev     Main     Next »
Total Comments 0

Comments

 

  



All times are GMT -5. The time now is 07:43 AM.

Main Menu
Advertisement

My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration