LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 01-20-2009, 11:19 PM   #1
murugesan
Member
 
Registered: May 2003
Posts: 149

Rep: Reputation: 28
To know the function on checking whether a character is ascii or unicode in C.


From the following url
http://www.codersource.net/win32_unicode_ascii.html
The function IsTextUnicode is related to Windows VC++ library.
I would like to know the library/function which provides such facility.
 
Old 01-21-2009, 12:58 AM   #2
PEdroArthur_JEdi
LQ Newbie
 
Registered: Jan 2008
Distribution: Slackware | Debian
Posts: 17

Rep: Reputation: 1
The code reads 80 bytes from the file and tries to determines if it is a UTF-8 encoded... And don't know how to do it, but there is a simple and easy way to check if the same text isn't in ASCII.

If you search for the ASCII table, you will realize that all values are in the range starting from 0 to 127. So, you may do something like this:

Code:
for (i = 0 ; i < 80 ; i++)
	if ((unsigned char)string[i] >= 0x80)
		return NON_ASCII;
May this help you...
 
Old 01-21-2009, 02:36 AM   #3
murugesan
Member
 
Registered: May 2003
Posts: 149

Original Poster
Rep: Reputation: 28
Hi,

I found "UTF-8 octet sequence" from the following url:
http://www.faqs.org/rfcs/rfc3629.html

checking for ch&0xF0
switch(ch&0xF0)
{
case 0xC0: // UTF-8 octet sequence
case 0xE0:
case 0xF0:
printf("unicode") ;
break ;
default:
printf("ascii") ;
}

Thanks for the reply.

Last edited by murugesan; 01-21-2009 at 02:38 AM. Reason: missed 0xF0 in switch statement
 
Old 01-23-2009, 10:51 PM   #4
graemef
Senior Member
 
Registered: Nov 2005
Location: Hanoi
Distribution: Fedora 13, Ubuntu 10.04
Posts: 2,376

Rep: Reputation: 147Reputation: 147
I'm not convinced that your code snippet will work. UTF-8 has a number of different byte sequences depending upon the number of bytes required to represent the Unicode character.

1 byte : 0xxxxxxx This is the same as 7-bit ASCII
2 bytes: 110xxxxx Followed by 10xxxxxx
3 bytes: 1110xxxx Followed by 10xxxxxx 10xxxxxx
4 bytes: 11110xxx Followed by 10xxxxxx 10xxxxxx 10xxxxxx
5 bytes: 111110xx Followed by 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 bytes: 1111110x Followed by 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Where x is the bits of the Unicode character in question and the ones or zeros are required for the encoding to be properly formed.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
non-ascii characters in bash script and unicode igor.R Linux - Newbie 31 12-29-2012 03:45 AM
To know the function on checking whether a character is ascii or unicode character. murugesan Programming 2 01-23-2009 01:07 PM
Unicode Vs. Ascii ? juanb Linux - General 1 06-19-2004 06:02 AM
How to detect non ascii filenames from an application which doesn't support UNICODE pankajtakawale Solaris / OpenSolaris 0 02-05-2004 06:31 AM
How to detect non ascii filenames from an application which doesn't support UNICODE ( pankajtakawale Solaris / OpenSolaris 0 02-05-2004 06:28 AM


All times are GMT -5. The time now is 10:44 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration