I am trying to remove most punctuation, but not all from a series of text files. The files are in Hebrew, I assume that means that they are Unicode, and are right justified.
There are curly braces around several specific Hebrew letters that indicate a specific amount of "white space" or a new line in the text. I don't want them removed but I do want the rest of the modern day punctuation removed.
This code should work but does not
Code:
w3m -dump -T text/html http://www.mechon-mamre.org/i/t/t0215.htm | sed '1,12d' | head -n -11 | sed -e "s/-/ /g; s/{/A/g; s/}/B/; s/[[:punct:]]//g; s/A/{/g; s/B/}/g" > ~/tmp/k0215.txt
As can be seen the final { on the first line is in place but the subsequent lines have lost it. Testing has revealed that is it the "s/}/B;" that is failing. It is interesting to note that the order of the curly braces is reversed in the file, as it is actually the left curly brace that is missing.
Or am I missing something?