LinuxQuestions.org - [SOLVED] sed character replacement

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - sed character replacement (https://www.linuxquestions.org/questions/linux-general-1/sed-character-replacement-877348/)

sed character replacement

I need to build up a dictionary from books containing just minor caps without special characters. Thus I wrote a small script to replace A-Z by a-z and all . - ! ? " ; , ' < > \n (newline) * [ ] by spaces. It works fine for . ! ? " ; , < > * [ but makes problems for ] - \n '

- is ignored when included inside the first command and ] breaks the regexp no matter if I escape the characters or not. I was able to create a work around by processing them separately. But I'd rather like to have as little command calls as possible as the files to process are complete books and thus rather large.

\n is simply not replaced and ' causes a syntax error.

Here comes the script:

Code:

#!/bin/sh

if [ $# -ne 1 ]

then

  echo "Usage: cleanbook.sh <filename>

  - filename: path to a file"

  exit 1

fi



FILENAME="$1"

CLEANFILENAME="$FILENAME.cleaned"



cp $FILENAME $CLEANFILENAME



cat $CLEANFILENAME

sed -i 's/[A-Z]/\L&/g' $CLEANFILENAME

cat $CLEANFILENAME

sed -i 's/[\.|!|?|"|;|,|<|>|*|\[]/ /g' $CLEANFILENAME

cat $CLEANFILENAME

sed -i 's/\]/ /g' $CLEANFILENAME

cat $CLEANFILENAME

sed -i 's/-/ /g' $CLEANFILENAME

cat $CLEANFILENAME

sed -i 's/\n/ /g' $CLEANFILENAME

cat $CLEANFILENAME

sed -i 's/\'/ /g' $CLEANFILENAME

cat $CLEANFILENAME



exit 0

Here is the output:

Code:

ABCDEFGHIJKLMNOPQRSTUVWXYZ.-!?";,'<>

*[]

abcdefghijklmnopqrstuvwxyz.-!?";,'<>

*[]

abcdefghijklmnopqrstuvwxyz -    '  

  ]

abcdefghijklmnopqrstuvwxyz -    '  

  

abcdefghijklmnopqrstuvwxyz      '  

  

abcdefghijklmnopqrstuvwxyz      '  

  

./cleanbook.sh: line 25: unexpected EOF while looking for matching `''

./cleanbook.sh: line 29: syntax error: unexpected end of file

Any suggestions are appreciated.

Code:

sed -i "s/'/ /g" $CLEANFILENAME

works for me.

Try:

Code:

sed -i "s/\'/ /g" $CLEANFILENAME

EDIT:
troop beat me to it :)

Cool - the doublequotes helped a lot (I don't understand why). Unfortunately \n is still completely ignored and ] still needs to be processed separately.

The script now is:

Code:

#!/bin/sh

if [ $# -ne 1 ]

then

  echo "Usage: cleanbook.sh <filename>

  - filename: path to a file"

  exit 1

fi



FILENAME="$1"

CLEANFILENAME="$FILENAME.cleaned"



cp $FILENAME $CLEANFILENAME



cat $CLEANFILENAME

sed -i 's/[A-Z]/\L&/g' $CLEANFILENAME

cat $CLEANFILENAME

sed -i "s/[-|\.|!|?|\"|;|,|<|>|*|\[|']/ /g" $CLEANFILENAME

cat $CLEANFILENAME

sed -i 's/\]/ /g' $CLEANFILENAME

cat $CLEANFILENAME

sed -i "s/\n/ /g" $CLEANFILENAME

cat $CLEANFILENAME



exit 0

The output is (note I attached "end" to the testfile):

Code:

ABCDEFGHIJKLMNOPQRSTUVWXYZ.-!?";,'<>

*[]end

abcdefghijklmnopqrstuvwxyz.-!?";,'<>

*[]end

abcdefghijklmnopqrstuvwxyz          

  ]end

abcdefghijklmnopqrstuvwxyz          

  end

abcdefghijklmnopqrstuvwxyz          

  end

The following script is not what I originally wanted but works sufficiently fast for my purpose:

Code:

#!/bin/sh

if [ $# -ne 1 ]

then

  echo "Usage: cleanbook.sh <filename>

  - filename: path to a file"

  exit 1

fi



FILENAME="$1"

CLEANFILENAME="$FILENAME.cleaned"



cp $FILENAME $CLEANFILENAME



sed -i 's/[A-Z]/\L&/g' $CLEANFILENAME

sed -i "s/[-|:|\.|!|?|\"|;|,|<|>|*|\[|'|(|)]/ /g" $CLEANFILENAME

sed -i 's/\]/ /g' $CLEANFILENAME

echo $(tr '\n' ' ' <$CLEANFILENAME) > $CLEANFILENAME

cat $CLEANFILENAME



exit 0

What I would do for such a problem is to use the inverse operator the '^'. For example:

Code:

This converts upper case to lower case, converts white space (including \n) to a space, and then any other character to a space, except for a-z and _.

Cool - I like that (and it works).