I have a PHP script which is trying to parse a huge (648,854 records) XML file to populate a MySQL database. I'm using the XMLReader PHP class. The problem I'm having is with bad character(s) at the 36,905th record.
Here's the error:
Code:
XMLReader::read(): /tmp/tmp.xml:377538: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x20 0x34 0x35 in /www/parse2.php on line 36
PHP Warning: XMLReader::read(): 34� 45' WEST in /www/parse2.php on line 36
PHP Warning: XMLReader::read(): ^ in /www/parse2 on line 36
PHP Warning: XMLReader::read(): An Error Occured while reading in /www/parse2.php on line 36
As you can see, the bad character(s) are coming through as 0xFFFD indicating a bad character. I want to either skip records when this happens during the parsing--or replace this bad character in the source XML file using sed or some other script. However, 0xFFFD is not actually the character (if I understand) but rather an indication of an unknown character.
Any ideas?
For testing, I'm essentially using this as the parser script (just counting the records to make sure they can all be parsed):
Code:
$count = 0;
echo "Reading";
while ($xml->read())
{
if ($xml->depth == 1 && $xml->name == 'Header') $xml->next();
// FileDetail
else if ($xml->depth == 3 && $xml->name == 'RecordDetails' )
{
$count++;
}
if ($count % 5000 == 0) echo '.';
}
$xml->close();
printf("# Processed %d record details.\n", $count);