Filter ASCII characters PHP

action_owl · 03-08-2010, 07:02 PM

How can I filter ASCII quotes( ' ) and double quotes ( " ) so that I can replace them with the UTF-8 equivalent?

If I copy text from a Word Document(ASCII), and upload it to a web page with PHP. The Database(UTF-8) will replace these characters with incorrect character(s).

I need some function that will replace these characters but I don't know how to differentiate the ASCII quotes and the UTF-8 Quotes without (somehow) converting the string to hex, then preg_replace'ing the hex code for the symbol.

nadroj · 03-08-2010, 08:27 PM

Quote:

Originally Posted by http://en.wikipedia.org/wiki/UTF-8

UTF-8 [...] is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII.

So the ASCII character set is a subset of UTF-8. So any ASCII character basically can be considered a UTF-8 character without any modification.

Try to copy/paste it from a plain-text editor, like "Notepad" for Windows, or GEdit for Linux (GNOME). Word might be doing something silly.

nadroj · 03-08-2010, 08:39 PM

Alternatively, save your source text file--the one with the 's and "s. Open it with a hex editor to verify that they are the expected values.

I've never really done any database work in PHP, but PHP may be "sanitizing" your input, for example to help prevent SQL injection attacks. ' and " characters are often used to do this attack, which is why you might see them appear differently once in the database as raw values. When you later get (i.e. "SELECT") these values, do they appear normal? If so, then that pretty much confirms that the behaviour you're seeing is due this automatic security precaution. Maybe there's some field or flag to turn it off, but I doubt you want to do that.

action_owl · 03-09-2010, 12:19 PM

Quote:

Quote:
Originally Posted by http://en.wikipedia.org/wiki/UTF-8
UTF-8 [...] is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII.
So the ASCII character set is a subset of UTF-8. So any ASCII character basically can be considered a UTF-8 character without any modification.

Try to copy/paste it from a plain-text editor, like "Notepad" for Windows, or GEdit for Linux (GNOME). Word might be doing something silly.

That's pretty much what I'm doing now.

Begining: "foo"

.doc/.docx file(word/open office) -->
gedit -->
clipboard -->
webbrowser -->
input box -->
database

End: (garbage)foo(garbage)

The database is set to: utf8_unicode_ci

nadroj · 03-09-2010, 03:00 PM

Try the suggestions/comments in my second post. i.e., verify that the correct hex values are in the file, check if the DBMS is automatically sanitizing/preventing SQL injection attacks of your input. Refer to that post for details.

action_owl · 04-01-2010, 03:39 PM

This is my solution to the problem in case anyone else needs it.

Converts the string to hex, looks for the offending characters and converts them to non offending ones.

Code:

	function charCleaner( $s )
	{
		$s = bin2hex($s);
		
		//Single Quote
		$patterns[0] = '/e28099/i';
		$replacements[0] = '27';
		
		//Double Quote Left
		$patterns[1] = '/e2809c/i';
		$replacements[1] = '22';

		//Double Quote Right
		$patterns[2] = '/e2809d/i';
		$replacements[2] = '22';

		//Dash
		$patterns[3] = '/e28094/i';
		$replacements[3] = '2d';

		$s = preg_replace($patterns, $replacements, $s);
		
		return pack("H*",$s);
	}

        echo charCleaner($someStringPastedFromMSWord);

graemef · 04-01-2010, 09:12 PM

I don't know what character set your original text is encoded in but the safest way to convert the text is to use the Linux iconv command. I am guessing that you have smart quotes in your text, UNICODE has an equivalent U+201C and U+201D (left and right double quotation mark respectively), whilst I'm sure your code works it may not pick up all special characters that may appear in the text whilst iconv will. For example if they are smart quotes then there are left and right single and double variations and I see a dash in your code which could be an em-dash but there is also and en-dash (double width and single width dashes)

What you need to do is to find out what encoding has been used in the original text.

mikejosh · 04-04-2010, 11:56 AM

hey i have a question in my mind.sorry if its irrelevant to this topic.

Can i run bash script (which i currently run on cygwin) in linux based web hosting? any way .Please help

GrapefruiTgirl · 04-04-2010, 11:58 AM

Quote:

Originally Posted by mikejosh

hey i have a question in my mind.sorry if its irrelevant to this topic.

Can i run bash script (which i currently run on cygwin) in linux based web hosting? any way .Please help

You've already hijacked your question onto one other location (here: http://www.linuxquestions.org/questi...21#post3923921 ).

Apologizing for doing this, while admirable, is not the way to go about posting questions.

Please cease doing this, and post your question in one location only, preferably by creating your own thread.

Thank you.
Sasha