Replacing an ASCII 272 character with space

sjobs12 · 06-01-2017, 04:45 PM

I have a file with a character that translates into 272 when I do:
od -xc

The character looks like an A with a caret on top with a degree sign to the right. I have tried sed, perl and tr and none of them are working.

For example:
sed -e 's/'$(echo "272")'/ /g' input_file > output_file

This does not work.

Any suggestions?

syg00 · 06-01-2017, 06:37 PM

That's because you are feeding in a string representation - you can use hex values directly in the sed substitution by using \x.. (or octal \o).
There something wrong if that (dec) number is supposed to be a single byte. Somebody that understands the myriad codepages might be able to explain that to you.

Laserbeak · 06-01-2017, 07:08 PM

Code:

$ cat 272.pl
#!/usr/bin/perl
while ($_ = <>) {
   s/\\272//gs;
   print $_, "\n";
}

$ cat infile
Hello Wor\272ld! It's a \272Small World After\272 All! Though the oceans are wide and the mountains divide, \272It's \272A Small World After \272All!
$ ./272.pl < infile > outfile
$ cat outfile
Hello World! It's a Small World After All! Though the oceans are wide and the mountains divide, It's A Small World After All!
$

NevemTeve · 06-02-2017, 12:20 AM

Let's note ASCII only covers codes 0-127. For 8 bit (0..255), there is ISO-8859-x (and many others). Above that there is Unicode.

http://www.unicode.org/charts/PDF/U0100.pdf

Code:

U+0110 Đ LATIN CAPITAL LETTER D WITH STROK

0x0110 = 272

Or, if 272 was meant to be hexadecimal:

http://www.unicode.org/charts/PDF/U0250.pdf

Code:

0272 ɲ LATIN SMALL LETTER N WITH LEFT HOOK

Anyways, you should first find out the actual file-encoding, eg:

Code:

$ printf 'szűrő' | od -tx1                   
0000000 73 7a fb 72 f5 # ISO-8859-2

$ printf 'szűrő' | iconv -f ISO-8859-2 -t UTF-8 | od -tx1
0000000 73 7a c5 b1 72 c5 91

$ printf 'szűrő' | iconv -f ISO-8859-2 -t UTF-16LE | od -tx1
0000000 73 00 7a 00 71 01 72 00 51 01

$ printf 'szűrő' | iconv -f ISO-8859-2 -t UTF-16BE | od -tx1
0000000 00 73 00 7a 01 71 00 72 01 51

sundialsvcs · 06-02-2017, 03:06 PM

A single byte has the numeric range 0..255, and does not include characters such as the one you mention.

Therefore, we know that it is a Unicode character, and that it is being represented in the UTF-8 encoding scheme as (in this case ....) a pair of bytes.

Any UTF-aware "search and replace in a string" function should be able to accomplish this job trivially ... and, today, most of them are. (But you might in some cases have to select UTF-encoding.)

ntubski · 06-02-2017, 03:13 PM

The OP used od -xc which produces sequences of 4 hex digits, single letters, and 3 octal digits. Therefore, '272', being 3 digits, is most likely on octal number.

Code:

$ printf $'\272etc' | od -xc
0000000    65ba    6374
        272   e   t   c
0000004

sundialsvcs · 06-02-2017, 08:46 PM

Quote:

Originally Posted by ntubski

The OP used od -xc which produces sequences of 4 hex digits, single letters, and 3 octal digits. Therefore, '272', being 3 digits, is most likely on octal number.

(Very respectfully ...) The OP also originally described "the character that (s)he was actually seeing" as: "The character looks like an A with a caret on top with a degree sign to the right. Thus, I must assume that the reference is to the binary value of the first of two bytes which actually comprise the character. The WikiPedia article on 'UTF-8' describes this multi-byte encoding scheme in detail.

(Best-guess as to the actual character: "Á")