[SOLVED] What is the difference between "ASCII English text" and "ASCII text" ???
SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
What is the difference between "ASCII English text" and "ASCII text" ???
Although not strictly a Slackware issue, I thought that I would ask here because I'm going to be deploying some copies of the application onto Slackware servers and this particular application finds, for example, even /etc/hosts in "ASCII English text" format so offensive that it borks and barfs.
I tried to start a thread about this and how to convert between one and the other here: http://www.linuxquestions.org/questi...h-text-895358/ but the only answer I got so far was so vague I really didn't even understand what the poster who helped me was trying to convey.
I always get good answers here, so maybe I will again (I hope).
I've never heard of that term in any official capacity, and I'm going to assume it is contrived by the author of your application. ASCII text, generally means printable ASCII-encoded text, generally delimited with newline characters (CR and/or LF), and formatted to be human readable. In computing, 'human-readable' doesn't necessarily mean 'English' (or any other language that can be composed with ASCII text). The hosts file would be a perfect example of such a file: most people would be able to recognize the characters in the file, and even read some host names and most of the comments. To describe any of it as English would be a stretch. Most of what you would find in a phone book would fall in the same category, in my opinion.
I doubt it would be practical to try to convert similar configuration files to something more English-like. They are made to be easily understood by computer programs, which don't know English. Moreover, the English language is so vaguely defined that you would probably have no way to verify the correctness of any such conversion.
Out of curiosity, what kind of application reads a hosts file and chokes on it due to poor English? Some kind of translation tool, such as the one tha Google uses to translate web pages?
As far as I can tell, it looks like it just looks for two occurrences of the word 'the'
Code:
gazl@slack:/tmp$ echo the >file && file file
file: ASCII text
gazl@slack:/tmp$ echo the the >file && file file
file: ASCII English text
gazl@slack:/tmp$ echo the then >file && file file
file: ASCII text
So far I haven't found any other words that trigger it. Be interested to know if there are any others.
Code:
gazl@slack:/tmp$ echo "I say old chap, that's just not cricket, what what" >file
gazl@slack:/tmp$ file file
file: ASCII text
As far as I can tell, it looks like it just looks for two occurrences of the word 'the'
Code:
gazl@slack:/tmp$ echo the >file && file file
file: ASCII text
gazl@slack:/tmp$ echo the the >file && file file
file: ASCII English text
gazl@slack:/tmp$ echo the then >file && file file
file: ASCII text
So far I haven't found any other words that trigger it. Be interested to know if there are any others.
That is totally bizzare. Why would duplicating a single (and quite particular) word trigger *file* to return such a result? That's a rhetorical question I think.
Quote:
Originally Posted by GazL
Code:
gazl@slack:/tmp$ echo "I say old chap, that's just not cricket, what what" >file
gazl@slack:/tmp$ file file
file: ASCII text
If that doesn't do it, I don't know what will.
I'm not quite sure I really follow... but I kind of get an idea. Yet the interesting thing is that when I create a file with *vi* (/bin/vi, which is actually a Vim version 7.0.237 executable and NOT a symlink), the file command says it is of type "ASCII text" - like we would expect.
Yet when I create a file with *vim* (/usr/bin/vim, which is actually a completely separate Vim version 7.0.237 executable and NOT a symlink), the file command says it is of type "ASCII English text" - which causes Cadence to freak out if that file happens to be anything that it touches, including /etc/hosts.
summing these two executables of the same Vim version return different sums, but one of the Vim's is 'vi' and the other one is 'vim', so maybe that has something to do with it.
I've tried using unix2dos and then the "tr" command to strip it back to a UNIX file and that doesn't take the "English" out of the "ASCII English text" returned by the file command once I convert it back to a UNIX file type by stripping the LF's.
Like I've shown before, even doing a "file /etc/*" returns a whole list of both file types, and I would never have noticed if Cadence wasn't being used.
Well Gazl, because of what you've been coming up with, I've been testing this on Slackware now too (remember the problem I'm concerned with is on CentOS 5.6), and I'm getting different results than you - sort of.
1.) I tried echo'ing "the the", "the what the", "what the what" and even "the raen in spaen lies moastly in the plaens" in the following ways, for example:
Code:
$ echo "the what the" > file
$ file file
file: ASCII English text
$ echo "what the what" > file2
$ file file2
file2: ASCII English text
$ echo "the raen in spaen is moastly in the plaens" > straen.txt
$ file straen.txt
straen.txt: ASCII English text
On Slackware, unlike CentOS, *vi* is /usr/bin/vi instead of /bin/vi, and it isn't Vim, it's Elvis. And on Slackware, it doesn't matter if I create a file with *vi* or *vim* - both return "ASCII text", unlike what the echo commands in my example above do.
But my original questions stand...
1.) What's the difference between "ASCII English text" and "ASCII text" ???
2.) How do I convert a file encoded as "ASCII English text" to "ASCII text" ???
One thing seems to be a common thread though - everyone who has commented so far has pretty much all agreed with is.... This is rather weird.
.
Last edited by astanton; 08-04-2011 at 01:34 AM.
Reason: mentioned that they don't both return the same sum
From the file src/names.h in the file-5.05 source tarball (bolding added):
Code:
/*
* XXX - how should we distinguish Java from C++?
* The trick used in a Debian snapshot, of having "extends" or "implements"
* as tags for Java, doesn't work very well, given that those keywords
* are often preceded by "class", which flags it as C++.
*
* Perhaps we need to be able to say
*
* If "class" then
*
* if "extends" or "implements" then
* Java
* else
* C++
* endif
*
* Or should we use other keywords, such as "package" or "import"?
* Unfortunately, Ada95 uses "package", and Modula-3 uses "import",
* although I infer from the language spec at
*
* http://www.research.digital.com/SRC/m3defn/html/m3.html
*
* that Modula-3 uses "IMPORT" rather than "import", i.e. it must be
* in all caps.
*
* So, for now, we go with "import". We must put it before the C++
* stuff, so that we don't misidentify Java as C++. Not using "package"
* means we won't identify stuff that defines a package but imports
* nothing; hopefully, very little Java code imports nothing (one of the
* reasons for doing OO programming is to import as much as possible
* and write only what you need to, right?).
*
* Unfortunately, "import" may cause us to misidentify English text
* as Java, as it comes after "the" and "The". Perhaps we need a fancier
* heuristic to identify Java?
*/
The code appears to look for 2 instances of "[Tt]he" to decide if it is looking at a java program or not. Look at the names array in the same file immediately after the above quoted comment.
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 3,541
Rep:
Here's a thing -- ASCII (generally pronounced ask-eee) is the acronym for American Standard Code for Information Interchange. Goes back to TeleTypes (not, however, to IBM punch card codes -- those are EBCDIC, Extended Binary Coded Decimal Interchange Code). Be eternally thankful you don't have to use EBCDIC for anything but historical interest (well, sorta).
I would think that you can identify English-English versus American-English (yeah, yeah, hang in there for a second) by certain key words that are spelled differently; e.g., colour, color, flavour, flavor, stuff like that. Just a WAG, but makes sense to me.
Just sort of happens that the ASCII code set got to be "standard" because somebody was smart enough to assign character in alpha-numeric order: "control" character first, followed by punctuation, followed by digits (and some punctuation), followed by upper case alpha (and some specials), followed by lower case alpha and followed by some more specials. Note that the entire code set is 7-bit (0 - 127 decimal); the 8th bit was used for parity checks.
No "special" European languages' characters (umlauts and the like). Oops.
So, the 8th bit started getting used for those characters.
From the file src/names.h in the file-5.05 source tarball
....
The code appears to look for 2 instances of "[Tt]he" to decide if it is looking at a java program or not. Look at the names array in the same file immediately after the above quoted comment.
oic...
And what Gazl was saying now makes complete sense to me too
How weird is that? Well, problem solved. It seems Cadence has a problem with a correctly formatted hosts file after all, even though they say you now can have your FQDN in there and not just the simple hostname. The regular "Redhat" way of listing the hostname on the 127 line, although incorrect, is apparently still what Cadence wants, contrary to what they're now saying about their product.
the the the the the the the LOL.
I'm marking this thread as solved, so at least there's something for Google to hit next time someone gets stumped on this [almost but not quite] non-issue.
I'm marking this thread as solved, so at least there's something for Google to hit next time someone gets stumped on this [almost but not quite] non-issue.
solved threads show up on Google. Solved threads also tend to indicate that if you follow a link to that resource it might provide you with a resolution to your questions too.
Google was almost completely devoid of any discussion on this matter, however, so a solved thread showing up in search results might be a bonus for the next person.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.