Any code I post here should be considered experimental and unfinished. Don't use it in a production environment. It is your own responsibility to evaluate the code's fitness for any purpose. Programming isn't done in a vacuum, so be prepared to do your own research and teach yourself to do better.
Most importantly, all your polite critiques, elaborations, and corrections are heartily welcomed.
Most importantly, all your polite critiques, elaborations, and corrections are heartily welcomed.
Selecting fields from a line using "sed"
Tags sed
Yeah, sed ...
No matter how well I think I know sed, it still manages to bite me more often than serve me. Everything I do know about sed, manuals aside, I learned from other programs like ed, vi, and even awk. What I'm getting to is this: my sed-fu is weak, so don't take anything I say about it at face value. Do your own research and learn better for yourself (which you ought to do regardless).
About portability
I'll be using the --posix option with GNU sed to ensure portability of commands presented in this article. In theory, these examples should work on traditional Unix systems just as well as they do my modern GNU/Linux system. GNU sed offers many features which greatly enhance its usefulness, but those enhancements are not guaranteed to be available on any other implementation of sed.
Using sed to break down text
NOTE: In the examples which follow, the strings <TAB> and <SPACE> shall be used to visually distinguish whitespace characters. A tab character is the character generated when you press the Tab key on your computer's keyboard, and is assigned the value 0x09 on the ASCII chart. A space character is the character generated when you press the Space bar key on your computer's keyboard, and is assigned the value 0x20 on the ASCII chart.
Here's our tab delimited input file.
(NOTE: Remember that each occurrence of <TAB> is a single tab character.)
We can select fields with a BRE (Basic Regular Expression). This example prints the second field.
The $'\t' bit is a Bash-ism to insert a literal tab character into the line. As far as I know there is no portable way to symbolically represent a tab character in sed regular expressions. To ensure portability we must use a literal tab character. If your shell isn't Bash, then you must learn its way to insert a literal tab character on the command line. Otherwise you may have to resort to something ugly like this (which I tested in dash):
Like I said: ugly, but portable. Either way, when sed reads the script from the command line it sees:
s/^\(..*\)<TAB>\(..*\)<TAB>.*$/\2/.
(NOTE: Remember that each occurrence of <TAB> is a single tab character.)
To understand this script we must know that sed automatically outputs each line of the input file. Any transformations specified by the script are performed on sed's internal copy of the line before the line is output.
The transformation we specified is s (substitution). The general form of s is:
addresss/regex/replacement/flags
where:
Our script instructed sed to replace the entire line with the second back reference. The entire line was matched because our regex is anchored at both ends. (NOTE: This may be a good opportunity for you to review BREs.) The end result is that only the field we're interested in was printed.
Selecting fields
Suppose we want to print fields one and three.
In this case we used three group expressions in regex, with back references to the first and third separated by a tab character in replacement.
Other field delimiters
Using a different character as a field delimiter will make our script easier to write if it means we can avoid encoding literal tab characters into regex. Since . is a regular expression meta-character we will still have to escape it.
If a different output field delimiter is desired, then simply insert it instead of . in replacement. Any string may be used, so neatly aligned columns are possible.
The $'<SPACE><SPACE><SPACE><SPACE>\t' bit is a Bash-ism which inserts four space characters followed by a single tab character. If your shell is not Bash, then you'll have insert those characters by some other method to duplicate the results.
Space delimited fields
Now there is some ambiguity as to which space characters constitute field delimiters. Guess what happens when we print field 3.
In regex we specified that the third group expression is immediately preceded by a single space character (outisde the group expression) and anchored to the line end. The remaining parts of the date field were absorbed into the first and second group expressions. This is easily demonstrated by specifying a unique output field delimiter in replacement.
Because of the ambiguity as to which space characters constitute field delimiters, we must specify which characters belong in which group expression.
Now let's print field 3.
Varialbe length field separators
(NOTE: Remember that each occurrence of <SPACE> is a single space character.)
When the file is printed in monospace, we get neatly aligned columns.
We specified that the group expressions are separated by single space characters. The match failed for lines one and three. When a match fails for a s command, the input line is passed unchanged to output. The easy and obvious fix is to specify that the group expressions are separated by one or more space characters.
Now let's print fields one and three, and separate the output fields with four space characters followed by a single tab character.
What we've shown is that sed is a very powerful text transformer, even in portability mode. The GNU sed extensions further enhance the power and capabilities of sed, and could have made our work here much easier, at the cost of portability. At the very least, much effort and eye strain could have been saved by GNU's ability to encode literal tab characters into regex.
Could a true sed master do better? Almost certainly. Submit your suggestions below

No matter how well I think I know sed, it still manages to bite me more often than serve me. Everything I do know about sed, manuals aside, I learned from other programs like ed, vi, and even awk. What I'm getting to is this: my sed-fu is weak, so don't take anything I say about it at face value. Do your own research and learn better for yourself (which you ought to do regardless).
About portability
I'll be using the --posix option with GNU sed to ensure portability of commands presented in this article. In theory, these examples should work on traditional Unix systems just as well as they do my modern GNU/Linux system. GNU sed offers many features which greatly enhance its usefulness, but those enhancements are not guaranteed to be available on any other implementation of sed.
Using sed to break down text
NOTE: In the examples which follow, the strings <TAB> and <SPACE> shall be used to visually distinguish whitespace characters. A tab character is the character generated when you press the Tab key on your computer's keyboard, and is assigned the value 0x09 on the ASCII chart. A space character is the character generated when you press the Space bar key on your computer's keyboard, and is assigned the value 0x20 on the ASCII chart.
Here's our tab delimited input file.
Code:
Ohio<TAB>Columbus<TAB>March 1, 1803 California<TAB>Sacremento<TAB>September 9, 1850 Texas<TAB>Austin<TAB>December 29, 1845
We can select fields with a BRE (Basic Regular Expression). This example prints the second field.
Code:
$ sed --posix 's/^\(..*\)'$'\t''\(..*\)'$'\t''.*$/\2/' tab-delimited-fields.txt Columbus Sacremento Austin
Code:
$ sed --posix "s/^\(..*\)`/usr/bin/printf \\\t`\(..*\)`/usr/bin/printf \\\t`.*\$/\2/" tab-delimited-fields.txt Columbus Sacremento Austin
s/^\(..*\)<TAB>\(..*\)<TAB>.*$/\2/.
(NOTE: Remember that each occurrence of <TAB> is a single tab character.)
To understand this script we must know that sed automatically outputs each line of the input file. Any transformations specified by the script are performed on sed's internal copy of the line before the line is output.
The transformation we specified is s (substitution). The general form of s is:
addresss/regex/replacement/flags
where:
- address selects which lines the following command acts upon. With no address specified, the following command acts on all lines.
- s is the substitution command.
- / is a character to separate the substitution command from its arguments.
- regex is a regular expression which the input line will be matched against. By default sed uses BREs.
- replacement is the replacement text. The replacement text will overwrite the first match for regex in sed's internal copy of the input line. Our replacement was \2, which is a backreference to the second group expression.
- flags are single character modifiers which change the operation of the s command. With no flags we get the default behavior of s.
Our script instructed sed to replace the entire line with the second back reference. The entire line was matched because our regex is anchored at both ends. (NOTE: This may be a good opportunity for you to review BREs.) The end result is that only the field we're interested in was printed.
Selecting fields
Suppose we want to print fields one and three.
Code:
$ sed --posix 's/^\(..*\)'$'\t''\(..*\)'$'\t''\(..*\)$/\1'$'\t''\3/' tab-delimited-fields.txt Ohio March 1, 1803 California September 9, 1850 Texas December 29, 1845
Other field delimiters
Code:
$ cat dot-delimited-fields.txt Ohio.Columbus.March 1, 1803 California.Sacremento.September 9, 1850 Texas.Austin.December 29, 1845
Code:
$ sed --posix 's/^\(..*\)\.\(..*\)\.\(..*\)$/\1.\3/' dot-delimited-fields.txt Ohio.March 1, 1803 California.September 9, 1850 Texas.December 29, 1845
Code:
$ sed --posix 's/^\(..*\)\.\(..*\)\.\(..*\)$/\1'$' \t''\3/' dot-delimited-fields.txt Ohio March 1, 1803 California September 9, 1850 Texas December 29, 1845
Space delimited fields
Code:
Ohio<SPACE>Columbus<SPACE>March<SPACE>1,<SPACE>1803 California<SPACE>Sacremento<SPACE>September<SPACE>9,<SPACE>1850 Texas<SPACE>Austin<SPACE>December<SPACE>29,<SPACE>1845
Code:
$ sed --posix 's/^\(..*\) \(..*\) \(..*\)$/\3/' space-delimited-fields.txt 1803 1850 1845
Code:
$ sed --posix 's/^\(..*\) \(..*\) \(..*\)$/\1@\2@\3/' space-delimited-fields.txt Ohio Columbus March@1,@1803 California Sacremento September@9,@1850 Texas Austin December@29,@1845
Code:
$ sed --posix 's/^\([^ ][^ ]*\) \([^ ][^ ]*\) \([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\1@\2@\3/' space-delimited-fields.txt Ohio@Columbus@March 1, 1803 California@Sacremento@September 9, 1850 Texas@Austin@December 29, 1845
Code:
$ sed --posix 's/^\([^ ][^ ]*\) \([^ ][^ ]*\) \([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\3/' space-delimited-fields.txt March 1, 1803 September 9, 1850 December 29, 1845
Code:
Ohio<SPACE><SPACE><SPACE><SPACE><SPACE><SPACE><SPACE>Columbus<SPACE><SPACE><SPACE>March<SPACE>1,<SPACE>1803 California<SPACE>Sacremento<SPACE>September<SPACE>9,<SPACE>1850 Texas<SPACE><SPACE><SPACE><SPACE><SPACE><SPACE>Austin<SPACE><SPACE><SPACE><SPACE><SPACE>December<SPACE>29,<SPACE>1845
When the file is printed in monospace, we get neatly aligned columns.
Code:
$ cat variable-space-delimited-fields.txt Ohio Columbus March 1, 1803 California Sacremento September 9, 1850 Texas Austin December 29, 1845
Code:
$ sed --posix 's/^\([^ ][^ ]*\) \([^ ][^ ]*\) \([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\1@\2@\3/' variable-space-delimited-fields.txt Ohio Columbus March 1, 1803 California@Sacremento@September 9, 1850 Texas Austin December 29, 1845
Code:
$ sed --posix 's/^\([^ ][^ ]*\) *\([^ ][^ ]*\) *\([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\1@\2@\3/' variable-space-delimited-fields.txt Ohio@Columbus@March 1, 1803 California@Sacremento@September 9, 1850 Texas@Austin@December 29, 1845
Code:
$ sed --posix 's/^\([^ ][^ ]*\) *\([^ ][^ ]*\) *\([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\1'$' \t''\3/' variable-space-delimited-fields.txt Ohio March 1, 1803 California September 9, 1850 Texas December 29, 1845
Could a true sed master do better? Almost certainly. Submit your suggestions below


Total Comments 0