LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Blogs > Here are some things which I hope will be helpful.
User Name
Password

Notices

Any code I post here should be considered experimental and unfinished. Don't use it in a production environment. It is your own responsibility to evaluate the code's fitness for any purpose. Programming isn't done in a vacuum, so be prepared to do your own research and teach yourself to do better.

Most importantly, all your polite critiques, elaborations, and corrections are heartily welcomed.
Rate this Entry

Selecting fields from a line using "sed"

Posted 12-25-2011 at 12:51 AM by Telengard
Updated 12-26-2011 at 10:24 PM by Telengard
Tags sed

Yeah, sed ...

No matter how well I think I know sed, it still manages to bite me more often than serve me. Everything I do know about sed, manuals aside, I learned from other programs like ed, vi, and even awk. What I'm getting to is this: my sed-fu is weak, so don't take anything I say about it at face value. Do your own research and learn better for yourself (which you ought to do regardless).

About portability

I'll be using the --posix option with GNU sed to ensure portability of commands presented in this article. In theory, these examples should work on traditional Unix systems just as well as they do my modern GNU/Linux system. GNU sed offers many features which greatly enhance its usefulness, but those enhancements are not guaranteed to be available on any other implementation of sed.

Using sed to break down text

NOTE: In the examples which follow, the strings <TAB> and <SPACE> shall be used to visually distinguish whitespace characters. A tab character is the character generated when you press the Tab key on your computer's keyboard, and is assigned the value 0x09 on the ASCII chart. A space character is the character generated when you press the Space bar key on your computer's keyboard, and is assigned the value 0x20 on the ASCII chart.

Here's our tab delimited input file.

Code:
Ohio<TAB>Columbus<TAB>March 1, 1803
California<TAB>Sacremento<TAB>September 9, 1850
Texas<TAB>Austin<TAB>December 29, 1845
(NOTE: Remember that each occurrence of <TAB> is a single tab character.)

We can select fields with a BRE (Basic Regular Expression). This example prints the second field.

Code:
$ sed --posix 's/^\(..*\)'$'\t''\(..*\)'$'\t''.*$/\2/' tab-delimited-fields.txt
Columbus
Sacremento
Austin
The $'\t' bit is a Bash-ism to insert a literal tab character into the line. As far as I know there is no portable way to symbolically represent a tab character in sed regular expressions. To ensure portability we must use a literal tab character. If your shell isn't Bash, then you must learn its way to insert a literal tab character on the command line. Otherwise you may have to resort to something ugly like this (which I tested in dash):

Code:
$ sed --posix "s/^\(..*\)`/usr/bin/printf \\\t`\(..*\)`/usr/bin/printf \\\t`.*\$/\2/" tab-delimited-fields.txt
Columbus
Sacremento
Austin
Like I said: ugly, but portable. Either way, when sed reads the script from the command line it sees:

s/^\(..*\)<TAB>\(..*\)<TAB>.*$/\2/.

(NOTE: Remember that each occurrence of <TAB> is a single tab character.)

To understand this script we must know that sed automatically outputs each line of the input file. Any transformations specified by the script are performed on sed's internal copy of the line before the line is output.

The transformation we specified is s (substitution). The general form of s is:

addresss/regex/replacement/flags

where:
  • address selects which lines the following command acts upon. With no address specified, the following command acts on all lines.
  • s is the substitution command.
  • / is a character to separate the substitution command from its arguments.
  • regex is a regular expression which the input line will be matched against. By default sed uses BREs.
  • replacement is the replacement text. The replacement text will overwrite the first match for regex in sed's internal copy of the input line. Our replacement was \2, which is a backreference to the second group expression.
  • flags are single character modifiers which change the operation of the s command. With no flags we get the default behavior of s.

Our script instructed sed to replace the entire line with the second back reference. The entire line was matched because our regex is anchored at both ends. (NOTE: This may be a good opportunity for you to review BREs.) The end result is that only the field we're interested in was printed.

Selecting fields

Suppose we want to print fields one and three.

Code:
$ sed --posix 's/^\(..*\)'$'\t''\(..*\)'$'\t''\(..*\)$/\1'$'\t''\3/' tab-delimited-fields.txt
Ohio    March 1, 1803
California      September 9, 1850
Texas   December 29, 1845
In this case we used three group expressions in regex, with back references to the first and third separated by a tab character in replacement.

Other field delimiters

Code:
$ cat dot-delimited-fields.txt
Ohio.Columbus.March 1, 1803
California.Sacremento.September 9, 1850
Texas.Austin.December 29, 1845
Using a different character as a field delimiter will make our script easier to write if it means we can avoid encoding literal tab characters into regex. Since . is a regular expression meta-character we will still have to escape it.

Code:
$ sed --posix 's/^\(..*\)\.\(..*\)\.\(..*\)$/\1.\3/' dot-delimited-fields.txt
Ohio.March 1, 1803
California.September 9, 1850
Texas.December 29, 1845
If a different output field delimiter is desired, then simply insert it instead of . in replacement. Any string may be used, so neatly aligned columns are possible.

Code:
$ sed --posix 's/^\(..*\)\.\(..*\)\.\(..*\)$/\1'$'    \t''\3/' dot-delimited-fields.txt
Ohio            March 1, 1803
California      September 9, 1850
Texas           December 29, 1845
The $'<SPACE><SPACE><SPACE><SPACE>\t' bit is a Bash-ism which inserts four space characters followed by a single tab character. If your shell is not Bash, then you'll have insert those characters by some other method to duplicate the results.

Space delimited fields

Code:
Ohio<SPACE>Columbus<SPACE>March<SPACE>1,<SPACE>1803
California<SPACE>Sacremento<SPACE>September<SPACE>9,<SPACE>1850
Texas<SPACE>Austin<SPACE>December<SPACE>29,<SPACE>1845
Now there is some ambiguity as to which space characters constitute field delimiters. Guess what happens when we print field 3.

Code:
$ sed --posix 's/^\(..*\) \(..*\) \(..*\)$/\3/' space-delimited-fields.txt
1803
1850
1845
In regex we specified that the third group expression is immediately preceded by a single space character (outisde the group expression) and anchored to the line end. The remaining parts of the date field were absorbed into the first and second group expressions. This is easily demonstrated by specifying a unique output field delimiter in replacement.

Code:
$ sed --posix 's/^\(..*\) \(..*\) \(..*\)$/\1@\2@\3/' space-delimited-fields.txt
Ohio Columbus March@1,@1803
California Sacremento September@9,@1850
Texas Austin December@29,@1845
Because of the ambiguity as to which space characters constitute field delimiters, we must specify which characters belong in which group expression.

Code:
$ sed --posix 's/^\([^ ][^ ]*\) \([^ ][^ ]*\) \([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\1@\2@\3/' space-delimited-fields.txt
Ohio@Columbus@March 1, 1803
California@Sacremento@September 9, 1850
Texas@Austin@December 29, 1845
Now let's print field 3.

Code:
$ sed --posix 's/^\([^ ][^ ]*\) \([^ ][^ ]*\) \([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\3/' space-delimited-fields.txt
March 1, 1803
September 9, 1850
December 29, 1845
Varialbe length field separators

Code:
Ohio<SPACE><SPACE><SPACE><SPACE><SPACE><SPACE><SPACE>Columbus<SPACE><SPACE><SPACE>March<SPACE>1,<SPACE>1803
California<SPACE>Sacremento<SPACE>September<SPACE>9,<SPACE>1850
Texas<SPACE><SPACE><SPACE><SPACE><SPACE><SPACE>Austin<SPACE><SPACE><SPACE><SPACE><SPACE>December<SPACE>29,<SPACE>1845
(NOTE: Remember that each occurrence of <SPACE> is a single space character.)

When the file is printed in monospace, we get neatly aligned columns.

Code:
$ cat variable-space-delimited-fields.txt
Ohio       Columbus   March 1, 1803
California Sacremento September 9, 1850
Texas      Austin     December 29, 1845
Code:
$ sed --posix 's/^\([^ ][^ ]*\) \([^ ][^ ]*\) \([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\1@\2@\3/' variable-space-delimited-fields.txt
Ohio       Columbus   March 1, 1803
California@Sacremento@September 9, 1850
Texas      Austin     December 29, 1845
We specified that the group expressions are separated by single space characters. The match failed for lines one and three. When a match fails for a s command, the input line is passed unchanged to output. The easy and obvious fix is to specify that the group expressions are separated by one or more space characters.

Code:
$ sed --posix 's/^\([^ ][^ ]*\)  *\([^ ][^ ]*\)  *\([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\1@\2@\3/' variable-space-delimited-fields.txt
Ohio@Columbus@March 1, 1803
California@Sacremento@September 9, 1850
Texas@Austin@December 29, 1845
Now let's print fields one and three, and separate the output fields with four space characters followed by a single tab character.

Code:
$ sed --posix 's/^\([^ ][^ ]*\)  *\([^ ][^ ]*\)  *\([^ ][^ ]* [^ ][^ ]* [^ ][^ ]*\)$/\1'$'    \t''\3/' variable-space-delimited-fields.txt
Ohio            March 1, 1803
California      September 9, 1850
Texas           December 29, 1845
What we've shown is that sed is a very powerful text transformer, even in portability mode. The GNU sed extensions further enhance the power and capabilities of sed, and could have made our work here much easier, at the cost of portability. At the very least, much effort and eye strain could have been saved by GNU's ability to encode literal tab characters into regex.

Could a true sed master do better? Almost certainly. Submit your suggestions below

Views 3886 Comments 0
« Prev     Main     Next »
Total Comments 0

Comments

 

  



All times are GMT -5. The time now is 01:43 PM.

Main Menu
Advertisement

My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration