LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-08-2010, 07:02 PM   #1
action_owl
Member
 
Registered: Jan 2009
Location: 127.0.0.1
Distribution: Fedora, CentOS, NetBSD
Posts: 115

Rep: Reputation: 17
Filter ASCII characters PHP


How can I filter ASCII quotes( ' ) and double quotes ( " ) so that I can replace them with the UTF-8 equivalent?

If I copy text from a Word Document(ASCII), and upload it to a web page with PHP. The Database(UTF-8) will replace these characters with incorrect character(s).

I need some function that will replace these characters but I don't know how to differentiate the ASCII quotes and the UTF-8 Quotes without (somehow) converting the string to hex, then preg_replace'ing the hex code for the symbol.
 
Old 03-08-2010, 08:27 PM   #2
nadroj
Senior Member
 
Registered: Jan 2005
Location: Canada
Distribution: ubuntu
Posts: 2,539

Rep: Reputation: 60
Quote:
Originally Posted by http://en.wikipedia.org/wiki/UTF-8
UTF-8 [...] is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII.
So the ASCII character set is a subset of UTF-8. So any ASCII character basically can be considered a UTF-8 character without any modification.

Try to copy/paste it from a plain-text editor, like "Notepad" for Windows, or GEdit for Linux (GNOME). Word might be doing something silly.
 
Old 03-08-2010, 08:39 PM   #3
nadroj
Senior Member
 
Registered: Jan 2005
Location: Canada
Distribution: ubuntu
Posts: 2,539

Rep: Reputation: 60
Alternatively, save your source text file--the one with the 's and "s. Open it with a hex editor to verify that they are the expected values.

I've never really done any database work in PHP, but PHP may be "sanitizing" your input, for example to help prevent SQL injection attacks. ' and " characters are often used to do this attack, which is why you might see them appear differently once in the database as raw values. When you later get (i.e. "SELECT") these values, do they appear normal? If so, then that pretty much confirms that the behaviour you're seeing is due this automatic security precaution. Maybe there's some field or flag to turn it off, but I doubt you want to do that.
 
Old 03-09-2010, 12:19 PM   #4
action_owl
Member
 
Registered: Jan 2009
Location: 127.0.0.1
Distribution: Fedora, CentOS, NetBSD
Posts: 115

Original Poster
Rep: Reputation: 17
Quote:
Quote:
Originally Posted by http://en.wikipedia.org/wiki/UTF-8
UTF-8 [...] is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII.
So the ASCII character set is a subset of UTF-8. So any ASCII character basically can be considered a UTF-8 character without any modification.

Try to copy/paste it from a plain-text editor, like "Notepad" for Windows, or GEdit for Linux (GNOME). Word might be doing something silly.
That's pretty much what I'm doing now.


Begining: "foo"

.doc/.docx file(word/open office) -->
gedit -->
clipboard -->
webbrowser -->
input box -->
database

End: (garbage)foo(garbage)

The database is set to: utf8_unicode_ci
 
Old 03-09-2010, 03:00 PM   #5
nadroj
Senior Member
 
Registered: Jan 2005
Location: Canada
Distribution: ubuntu
Posts: 2,539

Rep: Reputation: 60
Try the suggestions/comments in my second post. i.e., verify that the correct hex values are in the file, check if the DBMS is automatically sanitizing/preventing SQL injection attacks of your input. Refer to that post for details.
 
Old 04-01-2010, 03:39 PM   #6
action_owl
Member
 
Registered: Jan 2009
Location: 127.0.0.1
Distribution: Fedora, CentOS, NetBSD
Posts: 115

Original Poster
Rep: Reputation: 17
This is my solution to the problem in case anyone else needs it.

Converts the string to hex, looks for the offending characters and converts them to non offending ones.
Code:
	function charCleaner( $s )
	{
		$s = bin2hex($s);
		
		//Single Quote
		$patterns[0] = '/e28099/i';
		$replacements[0] = '27';
		
		//Double Quote Left
		$patterns[1] = '/e2809c/i';
		$replacements[1] = '22';

		//Double Quote Right
		$patterns[2] = '/e2809d/i';
		$replacements[2] = '22';

		//Dash
		$patterns[3] = '/e28094/i';
		$replacements[3] = '2d';

		$s = preg_replace($patterns, $replacements, $s);
		
		return pack("H*",$s);
	}

        echo charCleaner($someStringPastedFromMSWord);
 
Old 04-01-2010, 09:12 PM   #7
graemef
Senior Member
 
Registered: Nov 2005
Location: Hanoi
Distribution: Fedora 13, Ubuntu 10.04
Posts: 2,379

Rep: Reputation: 148Reputation: 148
I don't know what character set your original text is encoded in but the safest way to convert the text is to use the Linux iconv command. I am guessing that you have smart quotes in your text, UNICODE has an equivalent U+201C and U+201D (left and right double quotation mark respectively), whilst I'm sure your code works it may not pick up all special characters that may appear in the text whilst iconv will. For example if they are smart quotes then there are left and right single and double variations and I see a dash in your code which could be an em-dash but there is also and en-dash (double width and single width dashes)

What you need to do is to find out what encoding has been used in the original text.
 
Old 04-04-2010, 11:56 AM   #8
mikejosh
LQ Newbie
 
Registered: Jan 2010
Posts: 7

Rep: Reputation: 0
hey i have a question in my mind.sorry if its irrelevant to this topic.

Can i run bash script (which i currently run on cygwin) in linux based web hosting? any way .Please help
 
Old 04-04-2010, 11:58 AM   #9
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556
Quote:
Originally Posted by mikejosh View Post
hey i have a question in my mind.sorry if its irrelevant to this topic.

Can i run bash script (which i currently run on cygwin) in linux based web hosting? any way .Please help
You've already hijacked your question onto one other location (here: http://www.linuxquestions.org/questi...21#post3923921 ).

Apologizing for doing this, while admirable, is not the way to go about posting questions.

Please cease doing this, and post your question in one location only, preferably by creating your own thread.

Thank you.
Sasha
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
ascii filter options tannu_ah Linux - Newbie 1 03-06-2009 03:31 PM
ASCII characters in my script... Firebar Programming 9 10-27-2008 04:59 PM
Extended ASCII characters in UNIX MatSzor Programming 5 05-15-2004 09:57 PM
ascii characters lakshman Linux - General 1 03-14-2003 11:28 AM
Deleting non ASCII characters Thinkgeekness Linux - Networking 4 03-04-2003 01:29 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:21 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration