LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 06-29-2017, 01:10 AM   #1
rblampain
Senior Member
 
Registered: Aug 2004
Location: Western Australia
Distribution: Debian 9.2
Posts: 1,207

Rep: Reputation: 51
question about checking utf-8 textarea content


question about checking utf-8 textarea content

What visitors are supposed to submit in a HTML textarea is text only but can be in any language and needs to be checked for compliance and usability. There are Linux dictionaries (about 50) from which I have built lists of words but there are many more languages for which there is no electronic dictionary (AFAIK).

With my basic understanding of UTF-8 I assumed that it could be possible to check that the contents of a textarea were valid text if I could compare words from the textarea to a list of utf-8 words or a list of UTF-8 characters (letters only) in the same language. This is lists I would have to build since they do not exist and my first question is:
Is it a realistic approach to build such a list on the fact that the UTF-8 definitions for each character specify what the character is, letter or punctuation or symbol or whatever and extracting those specified as "letter" to make a list would allow to buid an alphabet against which textarea contents could be checked?

My second question relates to CJK languages (and some others) which do not use characters, there is a large list of CJK "words" or expressions (whatever the scientific name may be) between U+3400 and U+4AC9.
Is it realistic to assume that the contents of a textarea in those languages should correspond to those "words"?

In all cases, they should only conform within a certain percentage to allow for mispelling (too much bad spelling would make the contents unusable) and CJK contents in a textarea must not be reserved for an elite (I sort of remember, if I am correct, that the knowledge of a certain number of CJK words or expressions is considered "common" by its speakers while the knowledge of another and larger number is considered as "high education", perhaps those mentioned above are the "common" ones).

Any suggestion most welcome.

Thank you for your help.
 
Old 06-29-2017, 04:35 AM   #2
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, RPi OS, Mint & Android
Posts: 12,873

Rep: Reputation: 1678Reputation: 1678Reputation: 1678Reputation: 1678Reputation: 1678Reputation: 1678Reputation: 1678Reputation: 1678Reputation: 1678Reputation: 1678Reputation: 1678
I'm no text area specialist, but I would point you to what I believe to be the world's most translated website and suggest you 'reverse engineer' some of it's pages
http://www.jw.org has pages in nearly 900 languages with every alphabet type represented.

Personally, I feel you're entering a minefield, and would aim as low as you can get away with if you don't want to give your life to this.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to disable blacklist checking and enable content preview in spamassassin 3.3.1 jayram1989 Linux - Software 6 02-07-2014 12:45 AM
LXer: Howto install a content filtering and virus checking proxy (Part II) LXer Syndicated Linux News 0 05-13-2007 01:01 PM
Checking mail's content with a script gubak Linux - Newbie 6 05-04-2007 07:31 AM
LXer: Howto install a content filtering and virus checking proxy (Part I) LXer Syndicated Linux News 0 04-11-2007 11:01 PM
[Enter] in text documents diffrent on Windows and Linux? UTF-8/UTF-16 problem or? brynjarh Linux - General 1 11-24-2004 05:20 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 05:28 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration