LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   sed character replacement (https://www.linuxquestions.org/questions/linux-general-1/sed-character-replacement-877348/)

XXLRay 04-27-2011 03:44 AM

sed character replacement
 
I need to build up a dictionary from books containing just minor caps without special characters. Thus I wrote a small script to replace A-Z by a-z and all . - ! ? " ; , ' < > \n (newline) * [ ] by spaces. It works fine for . ! ? " ; , < > * [ but makes problems for ] - \n '

- is ignored when included inside the first command and ] breaks the regexp no matter if I escape the characters or not. I was able to create a work around by processing them separately. But I'd rather like to have as little command calls as possible as the files to process are complete books and thus rather large.

\n is simply not replaced and ' causes a syntax error.

Here comes the script:
Code:

#!/bin/sh
if [ $# -ne 1 ]
then
  echo "Usage: cleanbook.sh <filename>
  - filename: path to a file"
  exit 1
fi

FILENAME="$1"
CLEANFILENAME="$FILENAME.cleaned"

cp $FILENAME $CLEANFILENAME

cat $CLEANFILENAME
sed -i 's/[A-Z]/\L&/g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/[\.|!|?|"|;|,|<|>|*|\[]/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/\]/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/-/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/\n/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/\'/ /g' $CLEANFILENAME
cat $CLEANFILENAME

exit 0

Here is the output:
Code:

ABCDEFGHIJKLMNOPQRSTUVWXYZ.-!?";,'<>
*[]
abcdefghijklmnopqrstuvwxyz.-!?";,'<>
*[]
abcdefghijklmnopqrstuvwxyz -    ' 
  ]
abcdefghijklmnopqrstuvwxyz -    ' 
 
abcdefghijklmnopqrstuvwxyz      ' 
 
abcdefghijklmnopqrstuvwxyz      ' 
 
./cleanbook.sh: line 25: unexpected EOF while looking for matching `''
./cleanbook.sh: line 29: syntax error: unexpected end of file

Any suggestions are appreciated.

troop 04-27-2011 04:01 AM

Code:

sed -i "s/'/ /g" $CLEANFILENAME
works for me.

H_TeXMeX_H 04-27-2011 04:01 AM

Try:

Code:

sed -i "s/\'/ /g" $CLEANFILENAME
EDIT:
troop beat me to it :)

XXLRay 04-27-2011 04:19 AM

Cool - the doublequotes helped a lot (I don't understand why). Unfortunately \n is still completely ignored and ] still needs to be processed separately.

The script now is:
Code:

#!/bin/sh
if [ $# -ne 1 ]
then
  echo "Usage: cleanbook.sh <filename>
  - filename: path to a file"
  exit 1
fi

FILENAME="$1"
CLEANFILENAME="$FILENAME.cleaned"

cp $FILENAME $CLEANFILENAME

cat $CLEANFILENAME
sed -i 's/[A-Z]/\L&/g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i "s/[-|\.|!|?|\"|;|,|<|>|*|\[|']/ /g" $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/\]/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i "s/\n/ /g" $CLEANFILENAME
cat $CLEANFILENAME

exit 0

The output is (note I attached "end" to the testfile):
Code:

ABCDEFGHIJKLMNOPQRSTUVWXYZ.-!?";,'<>
*[]end
abcdefghijklmnopqrstuvwxyz.-!?";,'<>
*[]end
abcdefghijklmnopqrstuvwxyz         
  ]end
abcdefghijklmnopqrstuvwxyz         
  end
abcdefghijklmnopqrstuvwxyz         
  end


XXLRay 04-27-2011 05:10 AM

The following script is not what I originally wanted but works sufficiently fast for my purpose:
Code:

#!/bin/sh
if [ $# -ne 1 ]
then
  echo "Usage: cleanbook.sh <filename>
  - filename: path to a file"
  exit 1
fi

FILENAME="$1"
CLEANFILENAME="$FILENAME.cleaned"

cp $FILENAME $CLEANFILENAME

sed -i 's/[A-Z]/\L&/g' $CLEANFILENAME
sed -i "s/[-|:|\.|!|?|\"|;|,|<|>|*|\[|'|(|)]/ /g" $CLEANFILENAME
sed -i 's/\]/ /g' $CLEANFILENAME
echo $(tr '\n' ' ' <$CLEANFILENAME) > $CLEANFILENAME
cat $CLEANFILENAME

exit 0


H_TeXMeX_H 04-27-2011 06:03 AM

What I would do for such a problem is to use the inverse operator the '^'. For example:

Code:

cat file | tr [:upper:] [:lower:] | tr [:space:] " " | sed 's|[^_a-z]| |g'
This converts upper case to lower case, converts white space (including \n) to a space, and then any other character to a space, except for a-z and _.

XXLRay 04-27-2011 07:35 AM

Cool - I like that (and it works).


All times are GMT -5. The time now is 01:46 AM.