search unknown special characters on a textfile

bad_jaye · 05-01-2008, 08:07 PM

Hi--

I need to create a shell script to search for unknown speacial characters on a file.

sample:
3 0O
2 SP5
1 U ASAKOU
1 MARTSSAUT ÿ¾åbÿ
2 XID CHDSC
1 STAR
2 ID75

What I need is to print out the "1 MARTSSAUT ÿ¾åbÿ". I only need to retrieve the visible special charcters like "ÿ¾åbÿ". Note that the special characters varies from time to time so I need a flexible script that excludes [A-Z][a-z][0-9],white spaces and any characters that can be found on keyboard like "\ | * ? ^ # @ ! ~" and so on.

I hope someone can help me with this. I tried using grep, awk and tc but I cannot seem to get the desired result.

Thanks in advance.

eggixyz · 05-01-2008, 09:40 PM

Hey There,

You can use od to get you started, it'll pick all of those out of there and then you can ignore the regular stuff (note that the first field of the default output is the character offset, so you can ignore that, too)

Ex: with your file

Code:

-bash-3.2$ od -c yourFile
0000000    3       0   O  \n   2       S   P   5  \n   1       U       A
0000020    S   A   K   O   U  \n   1       M   A   R   T   S   S   A   U
0000040    T       ÿ   ¾   å   b   ÿ  \n   2       X   I   D       C   H
0000060    D   S   C  \n   1       S   T   A   R  \n   2       I   D   7
0000100    5  \n
0000102

Hope that helps get you started

Let me know if you need further help

, Mike

bad_jaye · 05-02-2008, 08:11 AM

Hi eggixyz,

Thanks for the help. Really appreciate it!!!

The things is that I needed a script that will only output the special characters. The textfile that I search usually composed of hundreds of line. Using your script will be like searching each lines manually for special characters. So this is the reason that I needed only the lines where special characters are present. In the example, I want the...

1 MARTSSAUT ÿ¾åbÿ

to be the only output of the script.

More power to you and godbless!!

This is very tedious job and excruciating if I have to go cheking each lines. Please help me.. O GOD Help me!!!

pixellany · 05-02-2008, 08:17 PM

I have been struggling with this, but have not solved it.

jsurles · 05-02-2008, 09:35 PM

Quote:

Originally Posted by pixellany

I have been struggling with this, but have not solved it.

I would do this.. of course, you'll need to add in the special chars like !@#$%^&&**() etc, I'm not sure if there's an easy thing like A-z or 0-9 with those chars.. but this seems to work:

Code:

for each in `sed 's/\(\)/ /g' samplefile`
do
  echo $each | egrep -v "[A-z]|[0-9]"
done

Tinkster · 05-02-2008, 09:46 PM

Quote:

Originally Posted by bad_jaye

Hi--
... script that excludes [A-Z][a-z][0-9],white spaces and any characters that can be found on keyboard like "\ | * ? ^ # @ ! ~" and so on.

I hope someone can help me with this. I tried using grep, awk and tc but I cannot seem to get the desired result.

Thanks in advance.

The "^I" below were produce in vi by pressing Ctrl-v<TAB>
Save it as clean.sed

Code:

s^I[^][ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\^_`abcdefghijklmnopqrstuvwxyz{|}~]^I^Ig

sed -f clean.sed funny_text

Cheers,
Tink

eggixyz · 05-02-2008, 09:55 PM

Hey There,

This will do it for your script.

Code:

#!/usr/bin/perl

open(FILE, "<G");
while (<FILE>)  {
        if ( $_ =~ /[^A-Za-z0-9\s\t]/ ) {
               print $_
        }
}
close(FILE);

And, even though this is ugly, I think this ignores pretty much everything that's "normal" (all 94 regular characters and space and tab -- just add \n, etc for whatever extra characters you want to not earmark)

#!/usr/bin/perl

open(FILE, "<G");
while (<FILE>) {
if ( $_ =~ /[^A-Za-z0-9\s\t\`\-=\[\]\\;\',\.\/~!@#$%^&\*_+\{\}\|:\"<>\?)]/ ) {
print $_
}
}
close(FILE);

If you need it for regexp outside of perl, sed/awk should be able to make all the same matches, with an extra backslash or two.

Hope that helps

, Mike