I have a PHP script which is trying to parse a huge (648,854 records) XML file to populate a MySQL database. I'm using the XMLReader PHP class. The problem I'm having is with bad character(s) at the 36,905th record.
Here's the error:
XMLReader::read(): /tmp/tmp.xml:377538: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x20 0x34 0x35 in /www/parse2.php on line 36
PHP Warning: XMLReader::read(): 34� 45' WEST in /www/parse2.php on line 36
PHP Warning: XMLReader::read(): ^ in /www/parse2 on line 36
PHP Warning: XMLReader::read(): An Error Occured while reading in /www/parse2.php on line 36
As you can see, the bad character(s) are coming through as 0xFFFD indicating a bad character. I want to either skip records when this happens during the parsing--or replace this bad character in the source XML file using sed or some other script. However, 0xFFFD is not actually the character (if I understand) but rather an indication of an unknown character.
For testing, I'm essentially using this as the parser script (just counting the records to make sure they can all be parsed):
$count = 0;
if ($xml->depth == 1 && $xml->name == 'Header') $xml->next();
else if ($xml->depth == 3 && $xml->name == 'RecordDetails' )
if ($count % 5000 == 0) echo '.';
printf("# Processed %d record details.\n", $count);