LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 06-05-2007, 04:41 PM   #1
MicahCarrick
Member
 
Registered: Jul 2004
Distribution: Fedora
Posts: 241

Rep: Reputation: 31
PHP XMLReader stops at bad character


I have a PHP script which is trying to parse a huge (648,854 records) XML file to populate a MySQL database. I'm using the XMLReader PHP class. The problem I'm having is with bad character(s) at the 36,905th record.

Here's the error:

Code:
XMLReader::read(): /tmp/tmp.xml:377538: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x20 0x34 0x35 in /www/parse2.php on line 36
PHP Warning:  XMLReader::read(): 34� 45' WEST in /www/parse2.php on line 36
PHP Warning:  XMLReader::read():   ^ in /www/parse2 on line 36
PHP Warning:  XMLReader::read(): An Error Occured while reading in /www/parse2.php on line 36
As you can see, the bad character(s) are coming through as 0xFFFD indicating a bad character. I want to either skip records when this happens during the parsing--or replace this bad character in the source XML file using sed or some other script. However, 0xFFFD is not actually the character (if I understand) but rather an indication of an unknown character.

Any ideas?

For testing, I'm essentially using this as the parser script (just counting the records to make sure they can all be parsed):
Code:
$count = 0;
echo "Reading";
while ($xml->read())
{
	if ($xml->depth == 1 && $xml->name == 'Header') $xml->next();

	// FileDetail
	else if ($xml->depth == 3 && $xml->name == 'RecordDetails' )
	{
	    $count++;
    }
    if ($count % 5000 == 0) echo '.';
}
$xml->close();

printf("# Processed %d record details.\n", $count);
 
Old 06-06-2007, 09:49 PM   #2
graemef
Senior Member
 
Registered: Nov 2005
Location: Hanoi
Distribution: Fedora 13, Ubuntu 10.04
Posts: 2,376

Rep: Reputation: 147Reputation: 147
The warning states that it is not a valid UTF-8 encoding, which leads me to ask. Is it meant to be a UTF-8 encoding? If so why has it become corrupt, and if not can you find the right encoding and change the encoding to the correct one?
 
Old 06-07-2007, 07:20 PM   #3
MicahCarrick
Member
 
Registered: Jul 2004
Distribution: Fedora
Posts: 241

Original Poster
Rep: Reputation: 31
It's for a client. It seems that there is just a couple of characters in there that are the wrong encoding. In fact, 36000 records were fine. So my assumption is that somewhere, somehow, somebody inserted some data with the wrong encoding. However, I am tasked with converting the XML to MySQL.
 
Old 06-07-2007, 07:25 PM   #4
MicahCarrick
Member
 
Registered: Jul 2004
Distribution: Fedora
Posts: 241

Original Poster
Rep: Reputation: 31
Does anybody know of a way to strip out all non-UTF characters from a file? Or possibly a way to do a search and replace such as "34* 45' WEST" / "34 45' WEST" where the "*" in the search is a wildcard character?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to rescue partition? (Bad block in disk, reiserfsck stops) gattumarrudu Linux - Hardware 4 06-29-2010 01:47 PM
LXer: Memory-efficient XML parsing in PHP with XMLReader LXer Syndicated Linux News 0 02-04-2007 05:33 PM
php: character \ in an input eantoranz Programming 3 09-27-2006 11:03 PM
Kernel stops when backup battery is bad dmergle Linux - General 4 11-03-2005 11:07 AM
MySQL and PHP: Multibyte Character Support Magsol Programming 4 04-26-2005 10:24 PM


All times are GMT -5. The time now is 09:32 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration