LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   PHP XMLReader stops at bad character (https://www.linuxquestions.org/questions/programming-9/php-xmlreader-stops-at-bad-character-559416/)

MicahCarrick 06-05-2007 04:41 PM

PHP XMLReader stops at bad character
 
I have a PHP script which is trying to parse a huge (648,854 records) XML file to populate a MySQL database. I'm using the XMLReader PHP class. The problem I'm having is with bad character(s) at the 36,905th record.

Here's the error:

Code:

XMLReader::read(): /tmp/tmp.xml:377538: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x20 0x34 0x35 in /www/parse2.php on line 36
PHP Warning:  XMLReader::read(): 34� 45' WEST in /www/parse2.php on line 36
PHP Warning:  XMLReader::read():  ^ in /www/parse2 on line 36
PHP Warning:  XMLReader::read(): An Error Occured while reading in /www/parse2.php on line 36

As you can see, the bad character(s) are coming through as 0xFFFD indicating a bad character. I want to either skip records when this happens during the parsing--or replace this bad character in the source XML file using sed or some other script. However, 0xFFFD is not actually the character (if I understand) but rather an indication of an unknown character.

Any ideas?

For testing, I'm essentially using this as the parser script (just counting the records to make sure they can all be parsed):
Code:

$count = 0;
echo "Reading";
while ($xml->read())
{
        if ($xml->depth == 1 && $xml->name == 'Header') $xml->next();

        // FileDetail
        else if ($xml->depth == 3 && $xml->name == 'RecordDetails' )
        {
            $count++;
    }
    if ($count % 5000 == 0) echo '.';
}
$xml->close();

printf("# Processed %d record details.\n", $count);


graemef 06-06-2007 09:49 PM

The warning states that it is not a valid UTF-8 encoding, which leads me to ask. Is it meant to be a UTF-8 encoding? If so why has it become corrupt, and if not can you find the right encoding and change the encoding to the correct one?

MicahCarrick 06-07-2007 07:20 PM

It's for a client. It seems that there is just a couple of characters in there that are the wrong encoding. In fact, 36000 records were fine. So my assumption is that somewhere, somehow, somebody inserted some data with the wrong encoding. However, I am tasked with converting the XML to MySQL.

MicahCarrick 06-07-2007 07:25 PM

Does anybody know of a way to strip out all non-UTF characters from a file? Or possibly a way to do a search and replace such as "34* 45' WEST" / "34 45' WEST" where the "*" in the search is a wildcard character?


All times are GMT -5. The time now is 01:52 AM.