LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   search unknown special characters on a textfile (https://www.linuxquestions.org/questions/linux-newbie-8/search-unknown-special-characters-on-a-textfile-639235/)

bad_jaye 05-01-2008 08:07 PM

search unknown special characters on a textfile
 
Hi--

I need to create a shell script to search for unknown speacial characters on a file.

sample:
3 0O
2 SP5
1 U ASAKOU
1 MARTSSAUT ÿ¾åbÿ
2 XID CHDSC
1 STAR
2 ID75

What I need is to print out the "1 MARTSSAUT ÿ¾åbÿ". I only need to retrieve the visible special charcters like "ÿ¾åbÿ". Note that the special characters varies from time to time so I need a flexible script that excludes [A-Z][a-z][0-9],white spaces and any characters that can be found on keyboard like "\ | * ? ^ # @ ! ~" and so on.

I hope someone can help me with this. I tried using grep, awk and tc but I cannot seem to get the desired result.

Thanks in advance.

eggixyz 05-01-2008 09:40 PM

Hey There,

You can use od to get you started, it'll pick all of those out of there and then you can ignore the regular stuff (note that the first field of the default output is the character offset, so you can ignore that, too)

Ex: with your file

Code:

-bash-3.2$ od -c yourFile
0000000    3      0  O  \n  2      S  P  5  \n  1      U      A
0000020    S  A  K  O  U  \n  1      M  A  R  T  S  S  A  U
0000040    T      ÿ  ¾  å  b  ÿ  \n  2      X  I  D      C  H
0000060    D  S  C  \n  1      S  T  A  R  \n  2      I  D  7
0000100    5  \n
0000102

Hope that helps get you started :)

Let me know if you need further help

, Mike

bad_jaye 05-02-2008 08:11 AM

Hi eggixyz,

Thanks for the help. Really appreciate it!!!

The things is that I needed a script that will only output the special characters. The textfile that I search usually composed of hundreds of line. Using your script will be like searching each lines manually for special characters. So this is the reason that I needed only the lines where special characters are present. In the example, I want the...

1 MARTSSAUT ÿ¾åbÿ

to be the only output of the script.

More power to you and godbless!!

This is very tedious job and excruciating if I have to go cheking each lines. Please help me.. O GOD Help me!!!

pixellany 05-02-2008 08:17 PM

I have been struggling with this, but have not solved it.

jsurles 05-02-2008 09:35 PM

Quote:

Originally Posted by pixellany (Post 3140564)
I have been struggling with this, but have not solved it.

I would do this.. of course, you'll need to add in the special chars like !@#$%^&&**() etc, I'm not sure if there's an easy thing like A-z or 0-9 with those chars.. but this seems to work:

Code:

for each in `sed 's/\(\)/ /g' samplefile`
do
  echo $each | egrep -v "[A-z]|[0-9]"
done


Tinkster 05-02-2008 09:46 PM

Quote:

Originally Posted by bad_jaye (Post 3139322)
Hi--
... script that excludes [A-Z][a-z][0-9],white spaces and any characters that can be found on keyboard like "\ | * ? ^ # @ ! ~" and so on.

I hope someone can help me with this. I tried using grep, awk and tc but I cannot seem to get the desired result.

Thanks in advance.

The "^I" below were produce in vi by pressing Ctrl-v<TAB>
Save it as clean.sed

Code:

s^I[^][ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\^_`abcdefghijklmnopqrstuvwxyz{|}~]^I^Ig
sed -f clean.sed funny_text



Cheers,
Tink

eggixyz 05-02-2008 09:55 PM

Hey There,

This will do it for your script.

Code:

#!/usr/bin/perl

open(FILE, "<G");
while (<FILE>)  {
        if ( $_ =~ /[^A-Za-z0-9\s\t]/ ) {
              print $_
        }
}
close(FILE);

And, even though this is ugly, I think this ignores pretty much everything that's "normal" (all 94 regular characters and space and tab -- just add \n, etc for whatever extra characters you want to not earmark)

#!/usr/bin/perl

open(FILE, "<G");
while (<FILE>) {
if ( $_ =~ /[^A-Za-z0-9\s\t\`\-=\[\]\\;\',\.\/~!@#$%^&\*\(\)_+\{\}\|:\"<>\?)]/ ) {
print $_
}
}
close(FILE);


If you need it for regexp outside of perl, sed/awk should be able to make all the same matches, with an extra backslash or two.

Hope that helps :)

, Mike


All times are GMT -5. The time now is 01:50 PM.