Sed and printing only part of a line

jrdioko · 07-26-2005, 08:21 PM

There've been a few times when I've wanted to do something like this, but all the pages I've read about sed still leave me confused about how it's supposed to be done. I have some output that I pipe through grep first to get the lines I want. I then want to only print the part of that line between two phrases I already know. For example, something prints the following:

This is the output of foo
Here is some useless information
Here is some important: DESIREDOUTPUT, information that you want

I can "grep important" to get the last line, but I want to use sed to tell it to print everything between "important: " and ", information". I'd like to be able to do this even if I don't know exactly what is before "important: " and after , information" and I don't know how many words or what characters DESIREDOUTPUT contains.

MensaWater · 07-27-2005, 08:32 AM

You're making it hard on yourself. If the file always has ":" after important and "," after DESIREDOUTPUT you can use awk instead:

grep important filename |awk -F: '{print $2}' | awk -F, '{print $1}'

The first awk says to set the delimiter to colon (instead of default which is white space). This would split the entry into two fields. Everything up to "important" is the first and everthing after the colon is the second.

The second awk says to split the remainder using the comma as delimiter at which point your DESIREDOUTPUT becomes the first field.

Alternatively if the lines always have the same number of words you can just do
grep important filename |awk '{print $5}' which would print the 5th word (as delimited by white space). If you need to get rid of the comma in DESIREDOUTPUT, you'd have to do the pipe shown above.

theYinYeti · 07-27-2005, 08:41 AM

Quote:

Originally posted by jlightner You're making it hard on yourself. If the file always has ":" after important and "," after DESIREDOUTPUT you can use awk instead:...

I disagree. First, what he gave was obviously an example, so maybe it is not ':' and ',' that will actually precede and follow. Next, it is less secure: who knows, DESIREDOUTPUT may itself contains those characters. Finally, I actually find the sed solution to be simpler.
Here it is:

Code:

... | grep 'important' | sed 's/^.*important: \(.*\), information.*$/\1/'

Or even simpler (no grep):

Code:

... | sed -n 's/^.*important: \(.*\), information.*$/\1/p'

Yves.

jrdioko · 07-27-2005, 08:31 PM

Thank you both. What is the difference between awk and sed anyway? I looked at both and it seems both work differently, are better for different things, but essential accomplish the same goals. Is it important to know both well for dealing with things like that or can one handle most?

-- EDIT --
Also, does * match one character and .* match any number of characters?

theYinYeti · 07-28-2005, 02:24 AM

In short, sed is a line editor: each line is read one by one (in the "pattern space"), and you do what you want using mostly regular expressions. sed has no variables, very few functions, and only one "buffer" for storing data (the "hold space").

awk is more of a programming language: it has some C-like functions and variables. It also has the built-in ability to split a line into fields, or ouput a line made of fields, using either separators of your choice, or fixed-length widths. awk, like sed, reads lines one by one.

Yves.

MensaWater · 07-28-2005, 07:47 AM

Generally ? means is a meta matching one character and * is a meta matching any number of characters. Do man on egrep and awk and go to the section for "Regular Expressions" for more detail. grep typically doesn't do well with the metacharacters in most flavors of Unix/Linux but egrep has more support for regular expressions.

Example of metacharacter usage for the two you listed - Say you have files named:

charlie
charles
charlene
harlan

ls charl?e would find only charlie

ls charl*e would find charlie and charlene (of course so would ls *e).

ls ?har* would find the first three entries but not the last.

ls *arl* would find all four. (If these were the only four in the directory so would ls *)

theYinYeti · 07-28-2005, 08:44 AM

jlightner did a good summary of shell patterns.

Regular expressions are another thing, though. And unfortunately, there are different variants.

In short, what is common to all:

REPLACED ITEMS:
^ stands for the beginning of the line
$ stands for the end of the line
. stands for any character
[any] stands for letter 'a' or 'n' or 'y'
[^any] stands for any letter except 'a', 'n', or 'y'.
( and ) are used to group things, that can thereafter be refered-to elsewhere
\1 to \9 are references to the grouping () number 1 through 9 (\0 is the whole pattern)
[:group:] indicates a group of possible characters among which any can be chosed; 'group' can be 'space', 'letter', 'digit', 'blank', 'print', 'alnum', 'alpha'...; this notation is only possible inside [...] or [^...]

QUANTIFIERS:
? after something means that this thing is there 0 or 1 time.
* after something means that this thing is there any number of times, including 0
{n} after something means that this thing is there n times
{m,} after something means that this thing is there m times or more
{m,n} after something means that this thing is there between m and n times

If no quantifier is used then the "thing" is there exactly 1 time.

Now the differences

Some applications (let's say group A) need the grouping ( and ) to be escaped like that: $ and $; same for the { and } quantifier delimiters.
For those, (, ), {, and } are simply standard characters.

Other applications (let's say group B) don't need those escapes.
So for those, normal characters (, ), {, and } have to be escaped with a \.

Additionnaly, perl-compatable regular expressions accept some usefull shorthand notations. See the PHP manual for details: I find it well explained.

sed and awk are in group A. Javascript and most text editors are in group B.

Yves.

MensaWater · 07-28-2005, 08:50 AM

My post was intended to answer his questions regarding metacharacters. I mentioned regular expressions as they allow for more granular selections than the simple metacharacters do.

I was surprised to find that neither my Debian nor my RedHat have regexp man pages. Most Unix variants do.

archtoad6 · 07-28-2005, 11:54 AM

You could also buy a book, or 2.

One the best buys I ever made was the 3rd edition (4th is the current) of Linux in a Nutshell from O'Reilly for US$4.98. It has good short chapters on both sed & awk. It also has the command reference (unlike the 2nd ed.) in one big chapter.

O'Reilly also publishes sed & awk & sed and awk Pocket Reference. (Ok, so that's 3 books.) I own the 1st 2 & can recommend them highly. I [sus|ex]pect the 3rd is equally good.

jrdioko · 07-28-2005, 12:43 PM

Thanks again. That looks like just the book to go out and buy, but I'm going to have to hold off until later. Thanks for the general regexp explanation, though. I've tried to learn some basics online but I get overwhelmed with the details.

Deedee393 · 05-17-2012, 09:47 AM

[QUOTE=MensaWater;1768168]You're making it hard on yourself. If the file always has ":" after important and "," after DESIREDOUTPUT you can use awk instead:

grep important filename |awk -F: '{print $2}' | awk -F, '{print $1}'

How would you go about saving the output produced?

MensaWater · 05-17-2012, 10:05 AM

[QUOTE=Deedee393;4680939]

Quote:

Originally Posted by MensaWater

You're making it hard on yourself. If the file always has ":" after important and "," after DESIREDOUTPUT you can use awk instead:

grep important filename |awk -F: '{print $2}' | awk -F, '{print $1}'

How would you go about saving the output produced?

You really shouldn't append to ancient threads (this was from 2005). The only people likely to see it are those who originally subscribed and that assumes they are still around. It is better to open a new thread and if desired post a link to the old thread.

Having said that:

The way to save output from most commands is with redirection.
grep important filename |awk -F: '{print $2}' | awk -F, '{print $1}' >outputfile

You can name outputfile anything you want.

You should investigate "file descriptors" and "redirection" but the most important information is that file descriptor 1 is standard output (a/k/a stdout) and file descriptor 2 is standard error (a/k/a stderr). The ">outputfile" is actually shorthand for "1>outputfile" to redirect stdout into the file. It is common in scripts to redirect stderr to stdout to insure both output types go to the same place:

grep important filename |awk -F: '{print $2}' | awk -F, '{print $1}' >outputfile 2>&1
The 2>&1 tells it to send stderr to same location as stdout.

You can also use the information as a variable within a program by assigning the output of the command to a variable. (If it is more than one line or more than one word you'd have to investigate arrays to get most use out of it).

VAR=$(grep important filename |awk -F: '{print $2}' | awk -F, '{print $1}')

I'd suggest you do a web search for "shell scripting tutorial". There are many available and they will give you a good start on the basics.

chrism01 · 05-17-2012, 07:15 PM

Seeing as this has been re-opened, the book on regex (imho) is here http://regex.info/
Also the orig qn sounds like using word boundary matches may help.