[SOLVED] sed regex and removing 'whitespace'

uncle-c · 04-22-2012, 01:30 PM

Was just reading the classic 'Sed One Liners' and I came up with this problem.

Code:

 $ cat file
one
  two
    three
 $

Could someone explain why

Code:

 $ sed 's/^[ \t]*//' file

removes any leading tabs & white spaces whereas

Code:

  sed 's/^[ \t]+//' file

does not ? What is the subtle difference between the two which causes only the former to remove leading white spaces from each line ?

Snark1994 · 04-22-2012, 01:42 PM

Because the '+' is matching literally. You need

Code:

sed 's/^[ \t]\+//' file

David the H. · 04-22-2012, 10:58 PM

To be more specific, it's the difference between basic and extended regular expressions. grep and sed use basic regex by default, and most of the more advanced regex devices like '+' are not supported.

But gnu grep and sed also offer extended regex, which allows you to "activate" the special meanings of the characters by backslashing them. Perhaps a better way to do it, however, is to enable them globally with the use of "grep -E" and "sed -r". Then the behavior becomes reversed; the special meanings are enabled by default, and backslash escaping them makes them literal.

Code:

sed -r 's/^[ \t]+//' file

The grep man page goes into good detail about basic vs. extended regex.

Incidentally, if all you want to do is remove all instances of (a) certain character(s), you'll get better performance with tr.

Code:

tr -d '[ \t]' <file

uncle-c · 04-23-2012, 03:27 AM

Cheers guys. I had been using tr but knew that there was a method using sed. It was only when I read the Sed One Liners page that the '+' problem got me thinking. Could you somehow use a white space character class - [:space:] instead of ' [\t] ' to achieve the same result ?

colucix · 04-23-2012, 03:52 AM

Quote:

Originally Posted by uncle-c

Could you somehow use a white space character class - [:space:] instead of ' [\t] ' to achieve the same result ?

Yes, but it is available using the extended regexp as well:

Code:

sed -r 's/^[[:space:]]+//' file

David the H. · 04-23-2012, 08:25 AM

Note that the [:space:] character class covers several other characters as well; the full list being tab, newline, vertical tab, form feed, carriage return, and space. There's also [:blank:] which contains only the regular space and tab characters, and so is exactly equivalent to the above.

The grep info page is one place you'll find definitions for what the various classes cover.