Any code I post here should be considered experimental and unfinished. Don't use it in a production environment. It is your own responsibility to evaluate the code's fitness for any purpose. Programming isn't done in a vacuum, so be prepared to do your own research and teach yourself to do better.
Most importantly, all your polite critiques, elaborations, and corrections are heartily welcomed.
Most importantly, all your polite critiques, elaborations, and corrections are heartily welcomed.
Selecting fields from a line using "cut"
Tags cut
using cut to break down text
You can use cut to grab only the data you want from a line of text. Tell cut which groups of characters you are interested in and it will print them. By default cut uses tab characters to break down groups of text from the input file.
NOTE: In the example data files which follow, the strings <TAB> and <SPACE> shall be used to visually distinguish whitespace characters. A tab character is the character generated when you press the Tab key on your computer's keyboard, and is assigned the value 0x09 on the ASCII chart. A space character is the character generated when you press the Space bar key on your computer's keyboard, and is assigned the value 0x20 on the ASCII chart.
Let us begin with a sample input file to be processed by cut. This file makes good use of cut's default case. Using tab field delimiters allows us to include normal space characters within fields.
(NOTE: Remember that each occurrence of <TAB> is a single tab character.)
Fields are counted by cut beginning with 1. Each instance of the field delimiter (tab by default) increments the field number. If we want a quick list of any single field, we can select it with cut --fields=num, where num is the desired field number.
Specifying fields
We can output more than one field by specifying a list of fields we want as num1,num2....
We can also specify a range of fields to print by putting a hyphen between the first desired field and the last.
To get a more explicit view of how cut is breaking the lines into fields, we can specify a unique character as an output delimiter. In the following example, the option --output-delimiter='@' tells cut to output @ instead of the literal field delimiters (tab characters).
Other field delimiters
Set a new field delimiter with --delimiter=char, where char is any single character. Similar to changing the output field delimiter, this option changes the input field delimiter. Input fields can be separated by any single unique character.
Space delimited fields
(NOTE: Remember that each occurrence of <SPACE> is a single space character.)
Now there is some ambiguity as to which space characters constitute field delimiters. Guess what happens when we print field 3.
The spaces in the date field cause it to be split into three fields instead of one. Now we have five fields to deal with instead of only three.
If the last field is omitted from a range, cut assumes we want all subsequent fields.
Varialbe length field separators
Now we're running into the limits of cut's usefulness. It counts each instance of the input field delimiter as a new field. Suppose we have a file where each field is separated by varible numbers of space characters. This can be useful to align data into neat columns.
When the file is printed in monospace, we get neatly aligned columns.
What happens when we print field 2?
Field 2 is now empty on lines 1 and 3. Let's set the output delimiter to a unique character to easily see where field 2 is.
Field 2 is empty because each instance of the delimiter counts as a new field. We could work around this by invoking cut within a Bash loop. Bash has facilities to test for empty strings. Let's try.
This solution is ugly, clumsy, and inefficient. Any solution this complicated isn't likely to be reliable unless input data is strictly controlled. Printing more than one field per line would now be a daunting task.
There must be a better way. Tune in for the next article in this series to find out

Addendum
Of course there are other solutions to the variable length separators problem. Just massage the data before passing it to cut. The tr utility can do the job quite handily with its --squeeze-repeats option.
This effectively reduces all runs of the delimiter to a single instance. Now the fields are easily accessible to cut.
Thanks to the LQWiki article for reminding me of this technique. The point remains that cut can't get the job done without some help.
You can use cut to grab only the data you want from a line of text. Tell cut which groups of characters you are interested in and it will print them. By default cut uses tab characters to break down groups of text from the input file.
NOTE: In the example data files which follow, the strings <TAB> and <SPACE> shall be used to visually distinguish whitespace characters. A tab character is the character generated when you press the Tab key on your computer's keyboard, and is assigned the value 0x09 on the ASCII chart. A space character is the character generated when you press the Space bar key on your computer's keyboard, and is assigned the value 0x20 on the ASCII chart.
Let us begin with a sample input file to be processed by cut. This file makes good use of cut's default case. Using tab field delimiters allows us to include normal space characters within fields.
Code:
Ohio<TAB>Columbus<TAB>March 1, 1803 California<TAB>Sacremento<TAB>September 9, 1850 Texas<TAB>Austin<TAB>December 29, 1845
Fields are counted by cut beginning with 1. Each instance of the field delimiter (tab by default) increments the field number. If we want a quick list of any single field, we can select it with cut --fields=num, where num is the desired field number.
Code:
$ cut --fields=2 tab-delimited-fields.txt Columbus Sacremento Austin
We can output more than one field by specifying a list of fields we want as num1,num2....
Code:
$ cut --fields=1,3 tab-delimited-fields.txt Ohio March 1, 1803 California September 9, 1850 Texas December 29, 1845
Code:
$ cut --fields=1-2 tab-delimited-fields.txt Ohio Columbus California Sacremento Texas Austin
Code:
$ cut --fields=1,2,3 --output-delimiter='@' tab-delimited-fields.txt Ohio@Columbus@March 1, 1803 California@Sacremento@September 9, 1850 Texas@Austin@December 29, 1845
Set a new field delimiter with --delimiter=char, where char is any single character. Similar to changing the output field delimiter, this option changes the input field delimiter. Input fields can be separated by any single unique character.
Code:
$ cat dot-delimited-fields.txt Ohio.Columbus.March 1, 1803 California.Sacremento.September 9, 1850 Texas.Austin.December 29, 1845
Code:
$ cut --delimiter='.' --fields=2 dot-delimited-fields.txt Columbus Sacremento Austin
Code:
Ohio<SPACE>Columbus<SPACE>March<SPACE>1,<SPACE>1803 California<SPACE>Sacremento<SPACE>September<SPACE>9,<SPACE>1850 Texas<SPACE>Austin<SPACE>December<SPACE>29,<SPACE>1845
Now there is some ambiguity as to which space characters constitute field delimiters. Guess what happens when we print field 3.
Code:
$ cut --delimiter=' ' --fields=3 space-delimited-fields.txt March September December
Code:
$ cut --delimiter=' ' --output-delimiter='@' --fields=1-5 space-delimited-fields.txt Ohio@Columbus@March@1,@1803 California@Sacremento@September@9,@1850 Texas@Austin@December@29,@1845
Code:
$ cut --delimiter=' ' --fields=3- space-delimited-fields.txt March 1, 1803 September 9, 1850 December 29, 1845
Now we're running into the limits of cut's usefulness. It counts each instance of the input field delimiter as a new field. Suppose we have a file where each field is separated by varible numbers of space characters. This can be useful to align data into neat columns.
Code:
Ohio<SPACE><SPACE><SPACE><SPACE><SPACE><SPACE><SPACE>Columbus<SPACE><SPACE><SPACE>March<SPACE>1,<SPACE>1803 California<SPACE>Sacremento<SPACE>September<SPACE>9,<SPACE>1850 Texas<SPACE><SPACE><SPACE><SPACE><SPACE><SPACE>Austin<SPACE><SPACE><SPACE><SPACE><SPACE>December<SPACE>29,<SPACE>1845
Code:
$ cat variable-space-delimited-fields.txt Ohio Columbus March 1, 1803 California Sacremento September 9, 1850 Texas Austin December 29, 1845
Code:
$ cut --delimiter=' ' --fields=2 variable-space-delimited-fields.txt Sacremento
Code:
$ cut --delimiter=' ' --fields=1- --output-delimiter='@' variable-space-delimited-fields.txt Ohio@@@@@@@Columbus@@@March@1,@1803 California@Sacremento@September@9,@1850 Texas@@@@@@Austin@@@@@December@29,@1845
Code:
$ while read line; do i=2; while [[ -z "$(echo "$line" | cut --delimiter=' ' --fields=$i)" ]]; do ((i++)); done; echo "$line" | cut --delimiter=' ' --fields=$i; done < variable-space-delimited-fields.txt Columbus Sacremento Austin
There must be a better way. Tune in for the next article in this series to find out


Addendum
Of course there are other solutions to the variable length separators problem. Just massage the data before passing it to cut. The tr utility can do the job quite handily with its --squeeze-repeats option.
Code:
$ tr --squeeze-repeats ' ' < variable-space-delimited-fields.txt Ohio Columbus March 1, 1803 California Sacremento September 9, 1850 Texas Austin December 29, 1845
Code:
$ tr --squeeze-repeats ' ' < variable-space-delimited-fields.txt | cut --delimiter=' ' --fields=2 Columbus Sacremento Austin
Total Comments 0