Deleting text between two different patterns

activeco · 09-12-2005, 03:23 PM

I want to put the command in a script but although seemingly very simple task I couldn't find the way to do it.
So, if I have some text in a file, on one or accross more lines, say: "asadgas<jk mjk bb><gjgksdlsl" ;
and I want to delete everything between "<jk" and ">" (in this case " mjk bb", usually of different length), what would be the best way to do it from bash?
I prefer sed as the processing files are pretty large and I would like to remove only the first matching instance and to exit immediately, but of course any working solution is welcome.
I could easily do it in rexx or php, but I would like to stay in bash.
Thanks in advance for all replies.

bigrigdriver · 09-12-2005, 05:36 PM

According to the Advanced Bash-Scripting Guide, chapter 12, section 4, bash has limited text editing capability of its own. To expand that capability, you need to invoke sed, awk, or some other scripting language from the bash script.

activeco · 09-12-2005, 05:59 PM

Thanks blgrlgdriver.
That is actually what I meant; how to do it with e.g. sed, awk or anything else built-in?

Tinkster · 09-12-2005, 06:54 PM

Quote:

Originally posted by activeco
Thanks blgrlgdriver.
That is actually what I meant; how to do it with e.g. sed, awk or anything else built-in?

Something like this? :)

Code:

#!/bin/awk -f
# strip between BeginTag and EndTag
# usage: awk -v BeginTag="xxx" -v EndTag="yyy" -f strip.awk input > output


BEGIN{
        if (!BeginTag) {
           print "usage: awk -v BeginTag="xxx" -v EndTag="yyy" -f strip.awk input"
           exit;
        }
}

{
        if (Split) {#
                if ($0 ~ EndTag) {
                        $0=substr($0,index($0,EndTag)+length(EndTag))
                        Split=0
                }
                else $0=""
        }

        if ($0 ~ BeginTag){
                Line=substr($0,1,index($0,BeginTag)-1)
                if ($0 ~ EndTag) Line=Line substr($0,index($0,EndTag)+length(EndTag))"\n"
                else Split=1
                if (Line=="" || Line=="\n") Line="!@!@empty"
        }

        if (Line) {
                if (Line != "!@!@empty") printf Line
                Line=""
        }
        else print $0

}

Code:

$ echo "asadgas<jk mjk bb><gjgksdlsl"|awk -v BeginTag="<jk" -v EndTag=">" -f strip.awk
asadgas<gjgksdlsl

That what you want?

Cheers,
Tink

activeco · 09-13-2005, 10:32 AM

Yes Tinkster, I'll use it although I expected one liner

Thank you very much for your time.

Tinkster · 09-13-2005, 01:41 PM

Sorry, not all problems can be solved with a one-liner ;}

This one is highly re-usable, though!

Cheers,
Tink

jschiwal · 09-13-2005, 04:51 PM

It is having the pattern across multiple lines that makes things more complicated.

If the pattern, or more than one pattern were contained on a single line, this one-liner would do it:

Code:

sed 's/<jk[^>]>/<jk >/g' originalfile >newfile

When crossing lines, when using sed, you need to add lines to the pattern space until the end pattern is reached:

Code:

# remblock.sed
# remove <jk > block
s/<jk[^>]>/<jk >/g     # handles pattern(s) on a single line
t:
/<jk/,/>/{                    # handle multilines between '<jk' and '>'
                />/! {          # not at the end marker '>'
                   /$/! {       #  This isn't the last line of the file.
                             N
                             bt
                          }        # add the next line to the pattern space and branch back to "t:"
                   }
                    s/<jk[^]]*>/<jk >/g

This script isn't too long. It may need tweaking in the case where the first end pattern is on a line, with the next start pattern on the same line. It does handle the cases where the pattern is on the same line, where more than one pattern is on the same line, where the pattern stretches across multiple lines.

You would call this program like:
sed -f remblock.sed originalfile >outputfile

If it is thoroughly tested and trusted, you could use inplace editing:
sed -i -f remblock.sed originalfile

activeco · 09-13-2005, 06:05 PM

Quote:

Originally posted by jschiwal

Code:

sed 's/<jk[^>]>/<jk >/g' originalfile >newfile

I like the thinking in this solution for the same line. I already played with sed's substitute option, but didn't think of simple and obvious way of providing the final replacement as the substitution string. Somewhere in the back, I always had the feeling that this is an unknown string while it is indeed - not.
The only thing I probably don't need here is the /g option as I need only one/first pattern(s) to be matched.

Well, thanks again guys.