[SOLVED] A sed bug ? or just me ?

rbees · 01-20-2019, 09:18 AM

I am trying to remove most punctuation, but not all from a series of text files. The files are in Hebrew, I assume that means that they are Unicode, and are right justified.

There are curly braces around several specific Hebrew letters that indicate a specific amount of "white space" or a new line in the text. I don't want them removed but I do want the rest of the modern day punctuation removed.

This code should work but does not

Code:

w3m -dump -T text/html http://www.mechon-mamre.org/i/t/t0215.htm | sed '1,12d' | head -n -11 | sed -e "s/-/ /g; s/{/A/g; s/}/B/; s/[[:punct:]]//g; s/A/{/g; s/B/}/g" >  ~/tmp/k0215.txt

As can be seen the final { on the first line is in place but the subsequent lines have lost it. Testing has revealed that is it the "s/}/B;" that is failing. It is interesting to note that the order of the curly braces is reversed in the file, as it is actually the left curly brace that is missing.

Or am I missing something?

pan64 · 01-20-2019, 09:52 AM

I do not really understand, but looks like a g is missing here:

Code:

sed -e "s/-/ /g; s/{/A/g; s/}/B/g; s/[[:punct:]]//g; s/A/{/g; s/B/}/g"

rbees · 01-20-2019, 11:05 AM

Thanks pan64

Somehow I missed that. All better now.