python: how do you replace unicode chars in large text files?

BrianK · 12-18-2008, 10:06 PM

If I have a text file that's ~1M chars & there happens to be 4 or 5 unicode chars.... is there a quick way to find & replace them without trying to convert the whole thing?

I don't care what they are converted to... could be '_' for all I care. If there's a way to convert to a similar char, that would be great, but not necessary.

If it matters, I'm trying to get around this problem:

Code:

  File "/usr/lib/python2.5/smtplib.py", line 493, in data
    self.send(q)
  File "/usr/lib/python2.5/smtplib.py", line 320, in send
    self.sock.sendall(str)
  File "<string>", line 1, in sendall
UnicodeEncodeError: 'ascii' codec can't encode character u'\uf029' in position 42061: ordinal not in range(128)

atom · 12-19-2008, 12:54 AM

You can get the behaviour you want by iconv -c.

For instance, you might want to do cat largefile | iconv -f utf-8 -t ascii -c > new_file. This drops all invalid characters. You can also do that from python.

There is also another solution: replace all character codes above 127 with a regular expression. utf-8 matches ascii for the first part of the character set.

Regards,

Gašper