LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 04-27-2011, 04:44 AM   #1
XXLRay
Member
 
Registered: May 2010
Posts: 126

Rep: Reputation: 16
sed character replacement


I need to build up a dictionary from books containing just minor caps without special characters. Thus I wrote a small script to replace A-Z by a-z and all . - ! ? " ; , ' < > \n (newline) * [ ] by spaces. It works fine for . ! ? " ; , < > * [ but makes problems for ] - \n '

- is ignored when included inside the first command and ] breaks the regexp no matter if I escape the characters or not. I was able to create a work around by processing them separately. But I'd rather like to have as little command calls as possible as the files to process are complete books and thus rather large.

\n is simply not replaced and ' causes a syntax error.

Here comes the script:
Code:
#!/bin/sh
if [ $# -ne 1 ]
then
  echo "Usage: cleanbook.sh <filename>
  - filename: path to a file"
  exit 1
fi

FILENAME="$1"
CLEANFILENAME="$FILENAME.cleaned"

cp $FILENAME $CLEANFILENAME

cat $CLEANFILENAME
sed -i 's/[A-Z]/\L&/g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/[\.|!|?|"|;|,|<|>|*|\[]/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/\]/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/-/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/\n/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/\'/ /g' $CLEANFILENAME
cat $CLEANFILENAME

exit 0
Here is the output:
Code:
ABCDEFGHIJKLMNOPQRSTUVWXYZ.-!?";,'<>
*[]
abcdefghijklmnopqrstuvwxyz.-!?";,'<>
*[]
abcdefghijklmnopqrstuvwxyz -     '  
  ]
abcdefghijklmnopqrstuvwxyz -     '  
   
abcdefghijklmnopqrstuvwxyz       '  
   
abcdefghijklmnopqrstuvwxyz       '  
   
./cleanbook.sh: line 25: unexpected EOF while looking for matching `''
./cleanbook.sh: line 29: syntax error: unexpected end of file
Any suggestions are appreciated.
 
Old 04-27-2011, 05:01 AM   #2
troop
Member
 
Registered: Feb 2010
Distribution: gentoo, arch, fedora, freebsd
Posts: 379

Rep: Reputation: 96
Code:
sed -i "s/'/ /g" $CLEANFILENAME
works for me.
 
Old 04-27-2011, 05:01 AM   #3
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Try:

Code:
sed -i "s/\'/ /g" $CLEANFILENAME
EDIT:
troop beat me to it
 
Old 04-27-2011, 05:19 AM   #4
XXLRay
Member
 
Registered: May 2010
Posts: 126

Original Poster
Rep: Reputation: 16
Cool - the doublequotes helped a lot (I don't understand why). Unfortunately \n is still completely ignored and ] still needs to be processed separately.

The script now is:
Code:
#!/bin/sh
if [ $# -ne 1 ]
then
  echo "Usage: cleanbook.sh <filename>
  - filename: path to a file"
  exit 1
fi

FILENAME="$1"
CLEANFILENAME="$FILENAME.cleaned"

cp $FILENAME $CLEANFILENAME

cat $CLEANFILENAME
sed -i 's/[A-Z]/\L&/g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i "s/[-|\.|!|?|\"|;|,|<|>|*|\[|']/ /g" $CLEANFILENAME
cat $CLEANFILENAME
sed -i 's/\]/ /g' $CLEANFILENAME
cat $CLEANFILENAME
sed -i "s/\n/ /g" $CLEANFILENAME
cat $CLEANFILENAME

exit 0
The output is (note I attached "end" to the testfile):
Code:
ABCDEFGHIJKLMNOPQRSTUVWXYZ.-!?";,'<>
*[]end
abcdefghijklmnopqrstuvwxyz.-!?";,'<>
*[]end
abcdefghijklmnopqrstuvwxyz          
  ]end
abcdefghijklmnopqrstuvwxyz          
   end
abcdefghijklmnopqrstuvwxyz          
   end
 
Old 04-27-2011, 06:10 AM   #5
XXLRay
Member
 
Registered: May 2010
Posts: 126

Original Poster
Rep: Reputation: 16
The following script is not what I originally wanted but works sufficiently fast for my purpose:
Code:
#!/bin/sh
if [ $# -ne 1 ]
then
  echo "Usage: cleanbook.sh <filename>
  - filename: path to a file"
  exit 1
fi

FILENAME="$1"
CLEANFILENAME="$FILENAME.cleaned"

cp $FILENAME $CLEANFILENAME

sed -i 's/[A-Z]/\L&/g' $CLEANFILENAME
sed -i "s/[-|:|\.|!|?|\"|;|,|<|>|*|\[|'|(|)]/ /g" $CLEANFILENAME
sed -i 's/\]/ /g' $CLEANFILENAME
echo $(tr '\n' ' ' <$CLEANFILENAME) > $CLEANFILENAME
cat $CLEANFILENAME

exit 0
 
Old 04-27-2011, 07:03 AM   #6
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
What I would do for such a problem is to use the inverse operator the '^'. For example:

Code:
cat file | tr [:upper:] [:lower:] | tr [:space:] " " | sed 's|[^_a-z]| |g'
This converts upper case to lower case, converts white space (including \n) to a space, and then any other character to a space, except for a-z and _.
 
Old 04-27-2011, 08:35 AM   #7
XXLRay
Member
 
Registered: May 2010
Posts: 126

Original Poster
Rep: Reputation: 16
Cool - I like that (and it works).
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
replacement with sed disruptive Programming 7 08-11-2010 12:36 PM
replacement with sed DeepSeaNautilus Programming 6 10-01-2008 07:48 AM
multiple character replacement by shell script mauran Programming 14 07-13-2007 03:46 AM
Character replacement SeT Linux - General 1 11-18-2004 01:21 PM
Insert character into a line with sed? & variables in sed? jago25_98 Programming 5 03-11-2004 07:12 AM


All times are GMT -5. The time now is 07:51 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration