sed substitution with p flag

7stud · 03-02-2007, 05:57 PM

Hi,

Here is my file:

Quote:

Robert Robert Robert text
Robert text
Robert

I'm using this substitution:

Code:

sed -e 's/R* /R. /p' myFile

According to a tutorial:

http://uw714doc.sco.com/en/SHL_autom...functions.html

the p flag:

Quote:

Prints the line if a successful replacement was done. The p flag causes the line to be written to the output only if a substitution was actually made by the s function.

Here is the output I get from my sed command:

Quote:

RobertR. Robert Robert text
RobertR. Robert Robert text
RobertR. text
RobertR. text
Robert

I expected this:

Quote:

Robert Robert Robert text
Robert text

Can someone explain the output to me?

7stud · 03-03-2007, 02:45 AM

I figured out the problem. Contrary to what my man pages say, sed does NOT use basic regular expressions, i.e. where * represents any character, any number of times. Instead, * is a modifier of the preceding character, meaning the preceding character should be present 0 or more times. So the regex I used is matching 'R' zero times plus a space, which matches the space after 'Robert'.

As for the p flag, apparently it causes the "pattern space" to be output immediately if a substitution took place, and then the pattern space gets output again in the normal course of things before the next line is processed.

jschiwal · 03-03-2007, 04:15 AM

If you want to use sed to filter out lines that don't match a pattern, use the '-n' option and then use the p flag to output the lines. That is why you had two lines output instead of one.

Guess what, regular expressions have their own manpage! man 7 regex.
When the sed manpage refers to basic regular expressions, it means that sed uses the same regular expressions as grep. This is versus the extended regular expressions that awk and egrep use. For example, using sed, this pattern is literal "(abc|def)" including the "()|" characters. Using extended regular expressions, the pattern matches either "abc" or "def". You can use some of the extended features by escaping them, as in "^a\{8\}". Using a wildcard like * to represent any characters like the shell does is called globbing. In sed, if you want to match the decimal point, you need to escape it in a regular expression so that it doesn't represent a character wildcard.
/\.mpg/

Suppose that you have downloaded a lot of podcasts on your laptop and you are running out of room. You use k3b to burn a CD full of podcasts and save the project as podcasts1.k3b.
Now you want to use the saved project to delete the backed up files. Using unzip to extract maindata.xml from podcasts.k3b, you notice a bunch of lines like:
<file name="sn0014.mp3" >
<url>/home/auser/podcasts/sn0014.mp3</url>

You want to extract the names of the files and pipe that to an "rm" command removing just the files backed up.
So you want to ignore all the configuration information and just get lines like:
/home/auser/podcasts/sn0014.mp3

This oneliner will extract the file information:
sed -n -e '/<url>/s/^<url>\(.*\)<\/url>/\1/p' maindata.xml

Some of the titles might contain white space or special characters, which will cause a problem, so if the output was null separated like the -print0 option of the "find" command that would be great:

sed -n -e '/<url>/s/^<url>\(.*\)<\/url>/\1/p' maindata.xml | tr '\n' '\000' | xargs -L 500 -0 rm

Only the lines starting with "<url>" match. So they are the only ones processed.
After using this one-liner for a while you may come to some instances where the rm command can't find the file and the output contains something like ">". Some characters are used by xml and so they need to be escaped. The ">" string is how they are escaped. The '<' and '>' symbols are also escaped. I'll leave it to you to add terms to the sed command to convert them back.
Sed works best when there are clearly defined patterns. The <url> and </url> strings are great anchors. The characters between them are saved "\(...\)", used as the replacement "\n", and printed; "p".

Regular expressions can be very ugly however. So ugly that they are easier to write than to read. What is important is noticing the patterns that exist in the text source, and use those patterns to decide which lines to process and as anchors to "contain" the regular expression's wild cards.

Here is another example. Lets look at the hardware information on my wireless device:

Code:

 /sbin/lspci -v | sed -n '/Wireless/,/^$/p'
02:02.0 Network controller: Broadcom Corporation BCM4306 802.11b/g Wireless LAN Controller (rev 03)
        Subsystem: Hewlett-Packard Company NX9500 Built-in Wireless
        Flags: bus master, fast devsel, latency 64, IRQ 217
        Memory at e0104000 (32-bit, non-prefetchable) [size=8K]

I didn't have to wade though pages of output.

I thought these examples would provide more real life examples and show how handy sed is as a filter for on the fly one-liners.

I wish the sed manual was better written with more typical examples such as multiline substitutions. The manual for awk "GAWK: Effective Awk Programming" is an excellent book. If you want to learn awk, I would highly recommend downloading the source and generating the PS or PDF manual from the source .texi files.

Often all it takes to generate print worthy documentation is "./configure && make pdf". Then look in the doc/ subdirectory for the pdf manual.