LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-27-2008, 02:49 PM   #1
Firebar
Member
 
Registered: Feb 2005
Location: Southampton (UK)
Distribution: Debian, RHEL and SuSE
Posts: 69

Rep: Reputation: 15
ASCII characters in my script...


Hi everyone,

I have a little script which manipulates some HTML files and extracts information from the pages. When I use wget to download an html file I end up with some special characters in there, like ^M (Control V + Control M - ASCII table Octal 15). The most annoying of these characters I get is @ (Octal 100). I'm using http://www.asciitable.com/ as a reference.

I was under the impression that dos2unix would remove these annoyances - but I was wrong on that assumption. So, using sed I've replaced the ^M with nothing. On the other hand, the @ I just cannot remove, simply because I cannot find the right character/key combo to replicate/enter it into my script.

So, an example:

sed 's/^M//g' file > tada

^ that removes the 015 carriage return character.

I can't enter the @ (100) character into my script, therefore I can't remove it. If I refer to the octal representation using tr like this:

tr -d "\100" < file > tada

It still doesn't remove the characters and that was sorta my last resort

Can anyone help me please? I hope I'm making sense.

EDIT - If I do, ALT + 100 (numeric pad) I get a wierd character on the command line, which I'm presuming is the @ (although it looks like a square box). I just can't enter that character into my script (using vi).

Last edited by Firebar; 10-27-2008 at 02:51 PM.
 
Old 10-27-2008, 03:08 PM   #2
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,230

Rep: Reputation: 724Reputation: 724Reputation: 724Reputation: 724Reputation: 724Reputation: 724Reputation: 724
My guess would be that although you see the @ char, it is actually not the @ char (ascii code 100) but another char

Try this experience:
Code:
echo -e '\000' > testfile
vi testfile
as you can see the ascii 0 character (NUL) is seen as @ in vi
 
Old 10-27-2008, 03:11 PM   #3
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 21,698

Rep: Reputation: 5738Reputation: 5738Reputation: 5738Reputation: 5738Reputation: 5738Reputation: 5738Reputation: 5738Reputation: 5738Reputation: 5738Reputation: 5738Reputation: 5738
Quote:
Originally Posted by Firebar View Post
Hi everyone,

I have a little script which manipulates some HTML files and extracts information from the pages. When I use wget to download an html file I end up with some special characters in there, like ^M (Control V + Control M - ASCII table Octal 15). The most annoying of these characters I get is @ (Octal 100). I'm using http://www.asciitable.com/ as a reference.

I was under the impression that dos2unix would remove these annoyances - but I was wrong on that assumption. So, using sed I've replaced the ^M with nothing. On the other hand, the @ I just cannot remove, simply because I cannot find the right character/key combo to replicate/enter it into my script.

So, an example:

sed 's/^M//g' file > tada

^ that removes the 015 carriage return character.

I can't enter the @ (100) character into my script, therefore I can't remove it. If I refer to the octal representation using tr like this:

tr -d "\100" < file > tada

It still doesn't remove the characters and that was sorta my last resort

Can anyone help me please? I hope I'm making sense.

EDIT - If I do, ALT + 100 (numeric pad) I get a wierd character on the command line, which I'm presuming is the @ (although it looks like a square box). I just can't enter that character into my script (using vi).
Try putting a "\" in front of the @, like

Code:
sed 's/\@//g' file > tada1
Also, in vi, if you want to enter a control character (like CTRL-M), you can hit a CTRL-V first, then hit CTRL-M (or whatever else you'd like).
 
Old 10-27-2008, 03:12 PM   #4
Disillusionist
Senior Member
 
Registered: Aug 2004
Location: England
Distribution: Ubuntu
Posts: 1,039

Rep: Reputation: 97
your code is correct, suggest that the character might not be @

have you tried passing the file to od

Code:
cat testfile|od
 
Old 10-27-2008, 04:18 PM   #5
Firebar
Member
 
Registered: Feb 2005
Location: Southampton (UK)
Distribution: Debian, RHEL and SuSE
Posts: 69

Original Poster
Rep: Reputation: 15
Thanks for your replies.

Quote:
as you can see the ascii 0 character (NUL) is seen as @ in vi
The NUL character has the ^ preceeding it, which can be seen in vi. These characters I'm coming across just look like normal @'s :S

Quote:
Try putting a "\" in front of the @, like
Unfortunately that doesn't work, presumably because these are some crazy ass characters rather than a 'normal' @

Quote:
have you tried passing the file to od
I can't say anything jumps out at me when I do this. These @ symbols are in a column all down the left margin (when using vi to view), there is nothing that od shows to suggest a collection of the same symbols in a row.
 
Old 10-27-2008, 04:21 PM   #6
Firebar
Member
 
Registered: Feb 2005
Location: Southampton (UK)
Distribution: Debian, RHEL and SuSE
Posts: 69

Original Poster
Rep: Reputation: 15
I should probably add that these files containing the characters are html pages downloaded using wget. Just to clarify. Is it perhaps some wierd kind of HTML character?
 
Old 10-27-2008, 04:30 PM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1979Reputation: 1979Reputation: 1979Reputation: 1979Reputation: 1979Reputation: 1979Reputation: 1979Reputation: 1979Reputation: 1979Reputation: 1979Reputation: 1979
Please, try
Code:
od -c file
and see if the @ symbol is "translated" to anything else. Moreover, you can try
Code:
od -tx1 file
to retrieve the hexadecimal codes of every single character.
 
Old 10-27-2008, 04:38 PM   #8
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,230

Rep: Reputation: 724Reputation: 724Reputation: 724Reputation: 724Reputation: 724Reputation: 724Reputation: 724
Maybe it's a sort of vi behaviour, the character doesn't really exist in the file
I have seen the @ character when opening files with super long lines in vi...
 
Old 10-27-2008, 04:44 PM   #9
Firebar
Member
 
Registered: Feb 2005
Location: Southampton (UK)
Distribution: Debian, RHEL and SuSE
Posts: 69

Original Poster
Rep: Reputation: 15
^ yes, my file is of type ASCII with very long lines.
 
Old 10-27-2008, 04:59 PM   #10
Firebar
Member
 
Registered: Feb 2005
Location: Southampton (UK)
Distribution: Debian, RHEL and SuSE
Posts: 69

Original Poster
Rep: Reputation: 15
Well, I've got to my 'post-processing' stage and it hasn't caused an issue. So in conclusion it must be a vi symptom. Pico shows no dodgy characters.

I guess that wraps it up. As ever, thanks Linuxquestions users
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
non-ascii characters in bash script and unicode igor.R Linux - Newbie 31 12-29-2012 03:45 AM
mouse keys and non-ascii characters elyk Slackware 8 12-02-2005 12:46 PM
Extended ASCII characters in UNIX MatSzor Programming 5 05-15-2004 09:57 PM
ascii characters lakshman Linux - General 1 03-14-2003 11:28 AM
Deleting non ASCII characters Thinkgeekness Linux - Networking 4 03-04-2003 01:29 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:43 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration