Problems defining SED pattern over multiple lines

dj_bridges · 10-29-2007, 04:35 AM

I am trying to run a SED expression that will remove data contained within two patterns. Unfortunately I can't seem to get SED to recognise the two unless they are on the same line. I believe that there is a way to concatenate lines, but am struggling. Anyway the data file is a bibtex file of references, and I want to remove all occurences of the note field. I therefore want to match:
First Pattern = ^\tnote
Second pattern = \},.$
and anything in between = .*

This works perfectly when there aren't any line breaks in between the first and second pattern. The question is how do I get SED to recognise the two patterns over multiple lines.

Any help / tips would be greatly appreciated.....

linuxgeek_ch · 10-29-2007, 04:48 AM

You can use the N operator to concatenate lines. For more Info see http://www.unix.org.ua/orelly/unix/sedawk/ch06_01.htm

ghostdog74 · 10-29-2007, 04:49 AM

have a read here

dj_bridges · 10-29-2007, 08:04 AM

Thanks for the links guys. I had been trying to use the N command, but have been failing miserably. Here are some of my attempts (replacing the matched lines with TEST for clarity):

sed -e /^\tnote/N s/^\tnote*/n*,.$/TEST/ Old > New
sed -e 's/^\tnote*/n*'\},.'$/TEST/' Old > New

Am going to try and replace using simpler terms e.g. words, over multiple lines, but would appreciate any other pointers....

ghostdog74 · 10-29-2007, 08:21 AM

best to show your sample file and your expected output.

pixellany · 10-29-2007, 09:29 AM

SED does all of its pattern matching in the "pattern space". The default is to read in one line to the pattern space, perform a test, then read in the next line.

To look for a pattern crossing more that one line, you would have to first use "N" to append one or more lines, then perform the tests. Here's a crude example (not tested):

cat filename | sed '{/^The/ {N; s/\n//; s/a.*b/_/g}}'

Translation:
Read filename into sed
For each line beginning with "The":
...Append another line
...Remove all newlines
...find all occurences of "a.*b" and replace with "_"

"a.*b" = "the letter a, followed by any # of characters, then the letter b"

Here is the best tutorial on SED that I have seen: http://www.grymoire.com/Unix/Sed.html

makyo · 10-29-2007, 10:18 AM

Hi.

An alternative with awk:

Code:

#!/usr/bin/env sh

# @(#) s2       Demonstrate deletion of bounded text, even across lines.

set -o nounset
echo

debug=":"
debug="echo"

## Use local command version for the commands in this demonstration.

echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version bash awk

echo

FILE=${1-data2}

# First Pattern = ^\tnote
# Second pattern = \},.$

echo " Input file:"
cat -A $FILE

echo
echo " Results from awk:"
awk '
/^\tnote/,/\},.$/       { print "deleted: " $0; next }
1       { print "kept   : " $0 }
' $FILE

exit 0

Prodcuing:

Code:

% ./s2

(Versions displayed with local utility "version")
GNU bash 2.05b.0
GNU Awk 3.1.4

 Input file:
beginning of text sample$
note - text that should not be deleted.$
^Inote - text  that SHOULD be deleted all on one line },.$
Another line$
^Inote - text that SHOULD be$
deleted and crosses lines (stuff) },.$
More lines - one$
two$
end of text sample$

 Results from awk:
kept   : beginning of text sample
kept   : note - text that should not be deleted.
deleted:        note - text  that SHOULD be deleted all on one line },.
kept   : Another line
deleted:        note - text that SHOULD be
deleted: deleted and crosses lines (stuff) },.
kept   : More lines - one
kept   : two
kept   : end of text sample

cheers, makyo

makyo · 10-29-2007, 01:00 PM

Hi.

Here are two aproaches with sed:

Code:

#!/bin/sh -

# @(#) s1       Demonstrate text deletion over a pattern-pattern # range.

echo
echo " (Versions displayed by local command \"version\")"
version sh sed cat

FILE=${1-data2}

echo
echo " Input file:"
cat -A $FILE

# First Pattern = ^\tnote
# Second pattern = \},.$

echo
echo " Results from sed (simple approach):"
sed '/^\tnote/,/},\.$/d' $FILE

echo
echo " Hold buffer approach:"
sed '{/^\tnote/ {N; s/\n//; /\tnote.*},\./d}}' $FILE
# based on:
# sed '{/^The/ {N; s/\n//; s/a.*b/_/g}}' $FILE

exit 0

Producing:

Code:

% ./s1

 (Versions displayed by local command "version")
GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu)
GNU sed version 4.1.2
cat (coreutils) 5.2.1

 Input file:
beginning of text sample$
note - text that should not be deleted.$
^Inote - text  that SHOULD be deleted all on one line },.$
Another line$
^Inote - text that SHOULD be$
deleted and crosses lines (stuff) },.$
More lines - one$
two$
end of text sample$

 Results from sed (simple approach):
beginning of text sample
note - text that should not be deleted.
More lines - one
two
end of text sample

 Hold buffer approach:
beginning of text sample
note - text that should not be deleted.
More lines - one
two
end of text sample

In both cases, a section of text was deleted that was probably not desired. I think this is the result of greedy matching. Perhaps someone will drop by with a way around this (or a correction), but I'd go with the awk, or something in perl ... cheers, makyo

dj_bridges · 10-30-2007, 06:36 AM

Makyo and everyone else,

Thank you so so much for taking the time out to help me. I managed to get the script to working using the following commands:

sed -e '/^\tnote/,/\},.$/d' \
-e '{/^\tnote/ {N; s/\n//; /\tnote.*},\./d}}' old > new

That was driving me nuts, so thanks for saving my sanity......

makyo · 10-30-2007, 11:14 AM

Hi.

The results of running the script in post #9 also deleted the text between the 2 note sequences in my test file.

Are you sure that it is working the way you want? ... cheers, makyo

dj_bridges · 10-31-2007, 05:07 AM

Quote:

Originally Posted by makyo

Hi.

The results of running the script in post #9 also deleted the text between the 2 note sequences in my test file.

Are you sure that it is working the way you want? ... cheers, makyo

Well spotted makyo - think I need to pay closer attention. OK so I have been fiddling around and the problem as you say is greedy matching when the two patterns are on one line. One (clumsy) workaround is to run sed twice as follows:

sed -e '/^\tnote.*\},./D' file > file1

Then again:

sed -e '{/^\tnote/ {N; s/\n//; /\tnote.*},\./d}}' \
-e '/^\tnote/,/\},.$/d' file1 > file2

I thought that had solved the problem, but this has just shifted the problem on to where the patterns are spread over two lines e.g.

Input:

First Line
\tnote = blah blah blah
blah blah},
Next Line
Last Line

Output

First Line
Last Line

Perhaps I need to start again with a different tool e.g. Awk or Perl, but the thought of starting again from scratch while learning something else is not very appealing....

dj_bridges · 10-31-2007, 05:18 AM

Ok I think it is finally solved. This solution will probably offend all the pure programmers, but I just need something that works. Anyway this script seems to work:

sed -e '/^\tnote.*\},./D' FILE1 > FILE2

sed -e '{/^\tnote/ {N; s/\n//; /\tnote.*},\./d}}' \
-e '/^\tnote.*\},./D' \
-e '/^\tnote/,/\},.$/d' FILE2 > FILE3

I have tested it with a large file and it works throughout.

Fingers crossed that I haven't missed anything...

makyo · 10-31-2007, 08:06 AM

Hi.

Yes, that seems to work. No doubt you would have thought of this minor improvement to combine both sed commands into a pipeline to avoid the extra intermediate file:

Code:

#!/bin/sh -

# @(#) user3    Demonstrate delete across lines with piping.

FILE=${1-data2}

echo
echo " Input file:"
my-nl $FILE

echo
echo " Results from sed (piped):"
sed -e '/^\tnote.*\},./D' $FILE |
sed -e '{/^\tnote/ {N; s/\n//; /\tnote.*},\./d}}' \
-e '/^\tnote.*\},./D' \
-e '/^\tnote/,/\},.$/d'

exit 0

Producing:

Code:

% ./user3

 Input file:

==> data2 <==

  1 beginning of text sample
  2 note - text that should not be deleted.
  3     note - text  that SHOULD be deleted all on one line },.
  4 Another line
  5     note - text that SHOULD be
  6 deleted and crosses lines (stuff) },.
  7 More lines - one
  8 two
  9 end of text sample

 Results from sed (piped):
beginning of text sample
note - text that should not be deleted.
Another line
More lines - one
two
end of text sample

I generally advise people to make it right, then -- if necessary -- make it run faster. The same goes for elegance, beauty, etc ... cheers, makyo