ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How can I filter ASCII quotes( ' ) and double quotes ( " ) so that I can replace them with the UTF-8 equivalent?
If I copy text from a Word Document(ASCII), and upload it to a web page with PHP. The Database(UTF-8) will replace these characters with incorrect character(s).
I need some function that will replace these characters but I don't know how to differentiate the ASCII quotes and the UTF-8 Quotes without (somehow) converting the string to hex, then preg_replace'ing the hex code for the symbol.
Alternatively, save your source text file--the one with the 's and "s. Open it with a hex editor to verify that they are the expected values.
I've never really done any database work in PHP, but PHP may be "sanitizing" your input, for example to help prevent SQL injection attacks. ' and " characters are often used to do this attack, which is why you might see them appear differently once in the database as raw values. When you later get (i.e. "SELECT") these values, do they appear normal? If so, then that pretty much confirms that the behaviour you're seeing is due this automatic security precaution. Maybe there's some field or flag to turn it off, but I doubt you want to do that.
Quote:
Originally Posted by http://en.wikipedia.org/wiki/UTF-8
UTF-8 [...] is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII.
So the ASCII character set is a subset of UTF-8. So any ASCII character basically can be considered a UTF-8 character without any modification.
Try to copy/paste it from a plain-text editor, like "Notepad" for Windows, or GEdit for Linux (GNOME). Word might be doing something silly.
Try the suggestions/comments in my second post. i.e., verify that the correct hex values are in the file, check if the DBMS is automatically sanitizing/preventing SQL injection attacks of your input. Refer to that post for details.
I don't know what character set your original text is encoded in but the safest way to convert the text is to use the Linux iconv command. I am guessing that you have smart quotes in your text, UNICODE has an equivalent U+201C and U+201D (left and right double quotation mark respectively), whilst I'm sure your code works it may not pick up all special characters that may appear in the text whilst iconv will. For example if they are smart quotes then there are left and right single and double variations and I see a dash in your code which could be an em-dash but there is also and en-dash (double width and single width dashes)
What you need to do is to find out what encoding has been used in the original text.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.