Command to delete words out of a text file.

Dazamondo · 06-30-2009, 07:19 AM

Hi

I am trying to delete whole words out of a text file that contain certain characters such as ø. I have tried many different ways such as sed and I have also been advised to use perl but I have not had any luck.

Anyone got any ideas or code that I could try to do this?

Cheers
Daz

zhjim · 06-30-2009, 07:35 AM

what about the tr command

Code:

NAME
       tr - translate or delete characters

SYNOPSIS
       tr [OPTION]... SET1 [SET2]

DESCRIPTION
       Translate, squeeze, and/or delete characters from standard input, writing to standard output.
......

onebuck · 06-30-2009, 07:42 AM

Hi,

Welcome to LQ!

Show us what you have tried.

Dazamondo · 06-30-2009, 07:58 AM

I cant use tr as I actually want to delete the whole word that has the character in it rather than just delete the character.I did use tr when I wanted single characters removing.
I have tried many different sed commands such as cat filename | sed 's/[:allnum:]ø[:allnum:]//g' > newfilename. I have asked some other people and they say it cant be done with sed, it needs to be perl but I was very confused when trying perl and got nowhere tbh

.

I'm very new to this and my commands are probably totally wrong but any help or advice will do.

zhjim · 06-30-2009, 08:09 AM

I don't see the reason why it should not be done with sed. But who knows. Can you provide a snippet of the text you want to clear?
Just as a note the sed script you used would only remove numbers around the (o with slash

).
Here is my version if this is the desired behavior

Code:

sed -e '/s/[0-9]o[0-9]//g' > newfile

or

Code:

sed --inplace -e '/s/[0-9]o[0-9]//g'

--inplace just does the subsition inside the original file. So only use it if you sure it does what you want.

colucix · 06-30-2009, 08:22 AM

Quote:

Originally Posted by Dazamondo

cat filename | sed 's/[:allnum:]ø[:allnum:]//g' > newfilename.

Your sed command should be:

Code:

sed -i.bck 's/[[:alnum:]]ø[[:alnum:]]//g' filename

You can also use a character list in place of ø to include all the characters you want to match in one-shot. Note that -i.bck will edit the file in place making a backup copy of the original file with the suffix .bck appended.

Dazamondo · 06-30-2009, 09:03 AM

Oh right I thought [:alnum:] was all letters and digits. No it still doesn't seem to like it, some snippets from the text file is shown below:

able
adlød
adele
administration
administer
Aalbørg

For example I would want the command to delete adlød and Aalbørg as they have the ø symbol. Thanks for your help guys.

colucix · 06-30-2009, 09:13 AM

Ok. I forgot to put asterisks to match against any number of alpha-numeric characters. If the file is made of lines containing a single word, you can use the delete command of sed to remove the whole line:

Code:

$ cat testfileable
adlød
adele
administration
administer
Aalbørg
$ sed '/[[:alnum:]]*ø[[:alnum:]]*/d' testfile
able
adele
administration
administer

xxloaf · 06-30-2009, 09:24 AM

Quote:

Originally Posted by Dazamondo

able
adlød
adele
administration
administer
Aalbørg

Is the file you are trying to replace have all the words on separate lines like this?

If so just use grep to get those words out

Code:

grep -v "ø" file.txt > newfile.txt

colucix · 06-30-2009, 09:26 AM

Quote:

Originally Posted by xxloaf

Is the file you are trying to replace have all the words on separate lines like this?

If so just use grep to get those words out

Code:

grep -v "ø" file.txt > newfile.txt

xxloaf, you hit the nail on the head! I was going to correct my post to suggest this simple solution.

Dazamondo · 06-30-2009, 09:41 AM

I have used both them commands before, I originally thought grep would be easiest (and used the same command as you suggested) but nothing will work on this file. I have just tried the commands on a smaller text file and it works great, however on the one I need it to work on they don't

. Could it be due to the number of words that the file holds - one million or so?

Cheers

colucix · 06-30-2009, 09:44 AM

Nope. These commands parse the file line after line and the number of lines does not make any difference. What do you mean for "they don't work"? Did you get any error message? Or just not the desired result?

Dazamondo · 06-30-2009, 09:48 AM

Like I say the both commands suggested work fine on a smaller file (the words that need to be removed are). However on the big file the command completes fine no errors but when I open the file it still has the words that should have been removed. That is why I am so confused because on the small file it works but the same command doesn't get the desired results on the big file.

xxloaf · 06-30-2009, 10:48 AM

Hmm are you sure you are using the correct character in your command?

If you just do

Code:

grep "ø" file.txt

Note the removal of the "-v"

Does the command return any results? If not you need to be sure you are grabbing the correct character, those weird ascii characters can be tricky at times.