PHP XMLReader stops at bad character
I have a PHP script which is trying to parse a huge (648,854 records) XML file to populate a MySQL database. I'm using the XMLReader PHP class. The problem I'm having is with bad character(s) at the 36,905th record.
Here's the error: Code:
XMLReader::read(): /tmp/tmp.xml:377538: parser error : Input is not proper UTF-8, indicate encoding ! Any ideas? For testing, I'm essentially using this as the parser script (just counting the records to make sure they can all be parsed): Code:
$count = 0; |
The warning states that it is not a valid UTF-8 encoding, which leads me to ask. Is it meant to be a UTF-8 encoding? If so why has it become corrupt, and if not can you find the right encoding and change the encoding to the correct one?
|
It's for a client. It seems that there is just a couple of characters in there that are the wrong encoding. In fact, 36000 records were fine. So my assumption is that somewhere, somehow, somebody inserted some data with the wrong encoding. However, I am tasked with converting the XML to MySQL.
|
Does anybody know of a way to strip out all non-UTF characters from a file? Or possibly a way to do a search and replace such as "34* 45' WEST" / "34 45' WEST" where the "*" in the search is a wildcard character?
|
All times are GMT -5. The time now is 01:52 AM. |