LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-07-2010, 04:59 PM   #1
hattori.hanzo
Member
 
Registered: Aug 2006
Posts: 168

Rep: Reputation: 15
Remove all lines containing extended characters


I am using 'sed -e /foo/d' to match lines which I want to delete from a file.

I discovered I have some lines which contain random (extended?) characters like 'è¨á»§äµäµ' which I would also like to delete.

The lines in the file should only contain alpha numeric characters.

Thanks & Regards
 
Old 11-07-2010, 05:19 PM   #2
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556
Perhaps:
Code:
sed '/[^a-z0-9A-Z ]/d'
# or:
sed '/[^[:alnum:] ]/d'
?

Note I left the "space" character there too - remove it if you don't want spaces either.

Last edited by GrapefruiTgirl; 11-07-2010 at 05:26 PM.
 
1 members found this post helpful.
Old 11-07-2010, 08:15 PM   #3
hattori.hanzo
Member
 
Registered: Aug 2006
Posts: 168

Original Poster
Rep: Reputation: 15
Thanks for the response. Probably I didnt make it clear in the first post. Here is a sample:

Code:
2010-11-07 13:03:56,2347985439,27437985441,SOA,com,domain,_sites,j,_tcp,
2010-11-07 13:03:56,2347985439,27437985441,A,com,domain,host4,,,
2010-11-07 13:03:57,2787984329,27437985441,SOA,com,domain,_sites,è¨á»§äµäµ                       _tcp,
2010-11-07 13:03:57,2787444439,27437985441,A,com,domain,host2,,,                     ï¼host,
2010-11-07 13:03:57,2780005439,27437985441,A,com,host9,ab2,,,,com,הּ࡫èủâºïª·Å
2010-11-07 13:04:01,2787843905,27437985441,A,com,host,us,host6,,
Thanks & Regards,
 
Old 11-07-2010, 08:35 PM   #4
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556
OK, so the only difference is that you may have punctuation in lines that you wish to keep?
Code:
sed '/[^[:alnum:][:punct:] ]/d'
This should be one way of working around it - so if a line contains stuff other than alphanumeric, spaces, or punctuation, it gets removed.

Does this work?
 
1 members found this post helpful.
Old 11-09-2010, 12:38 AM   #5
hattori.hanzo
Member
 
Registered: Aug 2006
Posts: 168

Original Poster
Rep: Reputation: 15
Hi, tested it but unfortunately not...

Code:
[root@host1 tmp]$ cat test.txt
2010-11-07 13:03:56,2347985439,27437985441,SOA,com,domain,_sites,j,_tcp,
2010-11-07 13:03:56,2347985439,27437985441,A,com,domain,host4,,,
2010-11-07 13:03:57,2787984329,27437985441,SOA,com,domain,_sites,軧äµ
_tcp,
2010-11-07 13:03:57,2787444439,27437985441,A,com,domain,host2,,,ïost,
2010-11-07 13:03:57,2780005439,27437985441,A,com,host9,ab2,,,,com,הּ࡫è§âª·Ã
2010-11-07
13:04:01,2
[root@host1 tmp]$ sed '/[^[:alnum:][:punct:] ]/d' test.txt
2010-11-07 13:03:56,2347985439,27437985441,SOA,com,domain,_sites,j,_tcp,
2010-11-07 13:03:56,2347985439,27437985441,A,com,domain,host4,,,
2010-11-07 13:03:57,2787984329,27437985441,SOA,com,domain,_sites,軧äµ
_tcp,
2010-11-07 13:03:57,2787444439,27437985441,A,com,domain,host2,,,ïost,
2010-11-07 13:03:57,2780005439,27437985441,A,com,host9,ab2,,,,com,הּ࡫è§âª·Ã
2010-11-07
13:04:01,2
Thanks & Regards
 
Old 11-09-2010, 05:00 AM   #6
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Hi,

unfortunately I do not see an easy way to accomplish this task with sed.
The Problem is that sed does not "see" the weird characters. Instead it sees their octal representation. The problem is, I do not know if it is possible to define "octal" character ranges in sed. Of what I've tried so far nothing has worked. So you will have to check for the octal representation of the "funny" characters and remove them like:
Code:
sed -r '/\o302/ d' file
If one line has "funny-chars" whose octal value is not 302, then those lines probably won't get removed. You will have to remove them like
Code:
sed -r '/\o302|\o203|\o303/ d' file
Maybe you should look for an alternative. One that lets you you specify octal ranges.
 
1 members found this post helpful.
Old 11-09-2010, 05:41 AM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
To me the problem is that in POSIX standard the character classes [:alnum:], [:punct:] and so on, include these special symbols and the reason is that they are valid characters in some languages. To restrict their meaning to the ASCII table, you can try to set the current locale to C, just for the execution of the sed command, e.g.
Code:
LC_ALL=C sed 's/[^[:alnum:][:punct:][:space:]]//g' file

Last edited by colucix; 11-09-2010 at 05:46 AM. Reason: disabled smiles in text
 
1 members found this post helpful.
Old 11-09-2010, 06:05 AM   #8
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,128

Rep: Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121
Once again a fine explanation by colucix - one assume the OP wants to delete the lines as per the previous responses rather than just the characters.
 
Old 11-09-2010, 06:24 AM   #9
hattori.hanzo
Member
 
Registered: Aug 2006
Posts: 168

Original Poster
Rep: Reputation: 15
Thanks colucix for the suggestion and explanation. That worked perfectly.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
using extended ascii characters sasser Linux - Newbie 2 04-06-2010 08:50 AM
[SOLVED] Need help writing a script to remove lines in which >X% of the characters are dashes kmkocot Linux - Newbie 14 12-02-2009 11:27 PM
How to remove lines and parts of lines from python strings? golmschenk Programming 3 11-26-2009 11:29 PM
bash printing extended ASCII characters nutthick Programming 6 02-04-2005 02:15 PM
Extended ASCII characters in UNIX MatSzor Programming 5 05-15-2004 09:57 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:28 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration