[SOLVED] What is the difference between "ASCII English text" and "ASCII text" ???

astanton · 08-03-2011, 06:30 PM

Although not strictly a Slackware issue, I thought that I would ask here because I'm going to be deploying some copies of the application onto Slackware servers and this particular application finds, for example, even /etc/hosts in "ASCII English text" format so offensive that it borks and barfs.

I tried to start a thread about this and how to convert between one and the other here: http://www.linuxquestions.org/questi...h-text-895358/ but the only answer I got so far was so vague I really didn't even understand what the poster who helped me was trying to convey.

I always get good answers here, so maybe I will again (I hope).

theNbomr · 08-03-2011, 06:47 PM

I've never heard of that term in any official capacity, and I'm going to assume it is contrived by the author of your application. ASCII text, generally means printable ASCII-encoded text, generally delimited with newline characters (CR and/or LF), and formatted to be human readable. In computing, 'human-readable' doesn't necessarily mean 'English' (or any other language that can be composed with ASCII text). The hosts file would be a perfect example of such a file: most people would be able to recognize the characters in the file, and even read some host names and most of the comments. To describe any of it as English would be a stretch. Most of what you would find in a phone book would fall in the same category, in my opinion.
I doubt it would be practical to try to convert similar configuration files to something more English-like. They are made to be easily understood by computer programs, which don't know English. Moreover, the English language is so vaguely defined that you would probably have no way to verify the correctness of any such conversion.
Out of curiosity, what kind of application reads a hosts file and chokes on it due to poor English? Some kind of translation tool, such as the one tha Google uses to translate web pages?

--- rod.

jmccue · 08-03-2011, 07:12 PM

Hi astanton

My take is "ASCII English Text" means 7-bit ASCII

John

theNbomr · 08-03-2011, 07:17 PM

Pensez-vous vraiment ce aussi simple?

--- rod.

moxieman99 · 08-03-2011, 07:40 PM

Quote:

Originally Posted by theNbomr

Pensez-vous vraiment ce aussi simple?

--- rod.

I love it when you talk dirty.

GazL · 08-03-2011, 07:54 PM

As far as I can tell, it looks like it just looks for two occurrences of the word 'the'

Code:

gazl@slack:/tmp$ echo the >file && file file
file: ASCII text
gazl@slack:/tmp$ echo the the >file && file file
file: ASCII English text
gazl@slack:/tmp$ echo the then >file && file file
file: ASCII text

So far I haven't found any other words that trigger it. Be interested to know if there are any others.

Code:

gazl@slack:/tmp$ echo "I say old chap, that's just not cricket, what what" >file
gazl@slack:/tmp$ file file
file: ASCII text

If that doesn't do it, I don't know what will.

astanton · 08-04-2011, 12:52 AM

Quote:

Originally Posted by theNbomr

I've never heard of that term in any official capacity, and I'm going to assume it is contrived by the author of your application.

The application you're referring to in this instance is:

/usr/bin/file (version 4.17)

Quote:

Originally Posted by theNbomr

Out of curiosity, what kind of application reads a hosts file and chokes on it due to poor English?

I don't know about poor English, but the application[s] you're referring to now is/are the offering[s] found here: http://Cadence.com

astanton · 08-04-2011, 01:05 AM

Quote:

Originally Posted by GazL

As far as I can tell, it looks like it just looks for two occurrences of the word 'the'

Code:

gazl@slack:/tmp$ echo the >file && file file
file: ASCII text
gazl@slack:/tmp$ echo the the >file && file file
file: ASCII English text
gazl@slack:/tmp$ echo the then >file && file file
file: ASCII text

So far I haven't found any other words that trigger it. Be interested to know if there are any others.

That is totally bizzare. Why would duplicating a single (and quite particular) word trigger *file* to return such a result? That's a rhetorical question I think.

Quote:

Originally Posted by GazL

Code:

gazl@slack:/tmp$ echo "I say old chap, that's just not cricket, what what" >file
gazl@slack:/tmp$ file file
file: ASCII text

If that doesn't do it, I don't know what will.

I'm not quite sure I really follow... but I kind of get an idea. Yet the interesting thing is that when I create a file with *vi* (/bin/vi, which is actually a Vim version 7.0.237 executable and NOT a symlink), the file command says it is of type "ASCII text" - like we would expect.

Yet when I create a file with *vim* (/usr/bin/vim, which is actually a completely separate Vim version 7.0.237 executable and NOT a symlink), the file command says it is of type "ASCII English text" - which causes Cadence to freak out if that file happens to be anything that it touches, including /etc/hosts.

summing these two executables of the same Vim version return different sums, but one of the Vim's is 'vi' and the other one is 'vim', so maybe that has something to do with it.

I've tried using unix2dos and then the "tr" command to strip it back to a UNIX file and that doesn't take the "English" out of the "ASCII English text" returned by the file command once I convert it back to a UNIX file type by stripping the LF's.

Like I've shown before, even doing a "file /etc/*" returns a whole list of both file types, and I would never have noticed if Cadence wasn't being used.

Well Gazl, because of what you've been coming up with, I've been testing this on Slackware now too (remember the problem I'm concerned with is on CentOS 5.6), and I'm getting different results than you - sort of.

1.) I tried echo'ing "the the", "the what the", "what the what" and even "the raen in spaen lies moastly in the plaens" in the following ways, for example:

Code:

$ echo "the what the" > file
$ file file
file: ASCII English text
$ echo "what the what" > file2
$ file file2
file2: ASCII English text
$ echo "the raen in spaen is moastly in the plaens" > straen.txt
$ file straen.txt
straen.txt: ASCII English text

On Slackware, unlike CentOS, *vi* is /usr/bin/vi instead of /bin/vi, and it isn't Vim, it's Elvis. And on Slackware, it doesn't matter if I create a file with *vi* or *vim* - both return "ASCII text", unlike what the echo commands in my example above do.

But my original questions stand...

1.) What's the difference between "ASCII English text" and "ASCII text" ???

2.) How do I convert a file encoded as "ASCII English text" to "ASCII text" ???

One thing seems to be a common thread though - everyone who has commented so far has pretty much all agreed with is.... This is rather weird.

.

Richard Cranium · 08-04-2011, 02:29 AM

Well, when in doubt, look at the source.

From the file src/names.h in the file-5.05 source tarball (bolding added):

Code:

/*
 * XXX - how should we distinguish Java from C++?
 * The trick used in a Debian snapshot, of having "extends" or "implements"
 * as tags for Java, doesn't work very well, given that those keywords
 * are often preceded by "class", which flags it as C++.
 *
 * Perhaps we need to be able to say
 *
 *	If "class" then
 *
 *		if "extends" or "implements" then
 *			Java
 *		else
 *			C++
 *	endif
 *
 * Or should we use other keywords, such as "package" or "import"?
 * Unfortunately, Ada95 uses "package", and Modula-3 uses "import",
 * although I infer from the language spec at
 *
 *	http://www.research.digital.com/SRC/m3defn/html/m3.html
 *
 * that Modula-3 uses "IMPORT" rather than "import", i.e. it must be
 * in all caps.
 *
 * So, for now, we go with "import".  We must put it before the C++
 * stuff, so that we don't misidentify Java as C++.  Not using "package"
 * means we won't identify stuff that defines a package but imports
 * nothing; hopefully, very little Java code imports nothing (one of the
 * reasons for doing OO programming is to import as much as possible
 * and write only what you need to, right?).
 *
 * Unfortunately, "import" may cause us to misidentify English text
 * as Java, as it comes after "the" and "The".  Perhaps we need a fancier
 * heuristic to identify Java?
 */

The code appears to look for 2 instances of "[Tt]he" to decide if it is looking at a java program or not. Look at the names array in the same file immediately after the above quoted comment.

tronayne · 08-04-2011, 07:25 AM

Here's a thing -- ASCII (generally pronounced ask-eee) is the acronym for American Standard Code for Information Interchange. Goes back to TeleTypes (not, however, to IBM punch card codes -- those are EBCDIC, Extended Binary Coded Decimal Interchange Code). Be eternally thankful you don't have to use EBCDIC for anything but historical interest (well, sorta).

I would think that you can identify English-English versus American-English (yeah, yeah, hang in there for a second) by certain key words that are spelled differently; e.g., colour, color, flavour, flavor, stuff like that. Just a WAG, but makes sense to me.

Just sort of happens that the ASCII code set got to be "standard" because somebody was smart enough to assign character in alpha-numeric order: "control" character first, followed by punctuation, followed by digits (and some punctuation), followed by upper case alpha (and some specials), followed by lower case alpha and followed by some more specials. Note that the entire code set is 7-bit (0 - 127 decimal); the 8th bit was used for parity checks.

No "special" European languages' characters (umlauts and the like). Oops.

So, the 8th bit started getting used for those characters.

Anyway, here's the standard 7-bit ASCII code set.

Code:

        Dec     Hex     Octal   Binary          ASCII
        000     000     0000    00000000        NUL     (Ctrl-@)
        001     001     0001    00000001        SOH     (Ctrl-A)
        002     002     0002    00000010        STX     (Ctrl-B)
        003     003     0003    00000011        ETX     (Ctrl-C)
        004     004     0004    00000100        EOT     (Ctrl-D)
        005     005     0005    00000101        ENQ     (Ctrl-E)
        006     006     0006    00000110        ACK     (Ctrl-F)
        007     007     0007    00000111        BEL     (Ctrl-G)
        008     008     0010    00001000        BS      (Ctrl-H)
        009     009     0011    00001001        HT      (Ctrl-I)
        010     00a     0012    00001010        NL      (Ctrl-J)
        011     00b     0013    00001011        VT      (Ctrl-K)
        012     00c     0014    00001100        NP      (Ctrl-L)
        013     00d     0015    00001101        CR      (Ctrl-M)
        014     00e     0016    00001110        SO      (Ctrl-N)
        015     00f     0017    00001111        SI      (Ctrl-O)
        016     010     0020    00010000        DLE     (Ctrl-P)
        017     011     0021    00010001        DC1     (Ctrl-Q)
        018     012     0022    00010010        DC2     (Ctrl-R)
        019     013     0023    00010011        DC3     (Ctrl-S)
        020     014     0024    00010100        DC4     (Ctrl-T)
        021     015     0025    00010101        NAK     (Ctrl-U)
        022     016     0026    00010110        SYN     (Ctrl-V)
        023     017     0027    00010111        ETB     (Ctrl-W)
        024     018     0030    00011000        CAN     (Ctrl-X)
        025     019     0031    00011001        EM      (Ctrl-Y)
        026     01a     0032    00011010        SUB     (Ctrl-Z)
        027     01b     0033    00011011        ESC     (Ctrl-[)
        028     01c     0034    00011100        FS      (Ctrl-\)
        029     01d     0035    00011101        GS      (Ctrl-])
        030     01e     0036    00011110        RS      (Ctrl-^)
        031     01f     0037    00011111        US      (Ctrl-_)
        032     020     0040    00100000        SP      (Ctrl-`)
        033     021     0041    00100001        !
        034     022     0042    00100010        "
        035     023     0043    00100011        #
        036     024     0044    00100100        $
        037     025     0045    00100101        %
        038     026     0046    00100110        &
        039     027     0047    00100111        '
        040     028     0050    00101000        (
        041     029     0051    00101001        )
        042     02a     0052    00101010        *
        043     02b     0053    00101011        +
        044     02c     0054    00101100        ,
        045     02d     0055    00101101        -
        046     02e     0056    00101110        .
        047     02f     0057    00101111        /
        048     030     0060    00110000        0
        049     031     0061    00110001        1
        050     032     0062    00110010        2
        051     033     0063    00110011        3
        052     034     0064    00110100        4
        053     035     0065    00110101        5
        054     036     0066    00110110        6
        055     037     0067    00110111        7
        056     038     0070    00111000        8
        057     039     0071    00111001        9
        058     03a     0072    00111010        :
        059     03b     0073    00111011        ;
        060     03c     0074    00111100        <
        061     03d     0075    00111101        =
        062     03e     0076    00111110        >
        063     03f     0077    00111111        ?
        064     040     0100    01000000        @
        065     041     0101    01000001        A
        066     042     0102    01000010        B
        067     043     0103    01000011        C
        068     044     0104    01000100        D
        069     045     0105    01000101        E
        070     046     0106    01000110        F
        071     047     0107    01000111        G
        072     048     0110    01001000        H
        073     049     0111    01001001        I
        074     04a     0112    01001010        J
        075     04b     0113    01001011        K
        076     04c     0114    01001100        L
        077     04d     0115    01001101        M
        078     04e     0116    01001110        N
        079     04f     0117    01001111        O
        080     050     0120    01010000        P
        081     051     0121    01010001        Q
        082     052     0122    01010010        R
        083     053     0123    01010011        S
        084     054     0124    01010100        T
        085     055     0125    01010101        U
        086     056     0126    01010110        V
        087     057     0127    01010111        W
        088     058     0130    01011000        X
        089     059     0131    01011001        Y
        090     05a     0132    01011010        Z
        091     05b     0133    01011011        [
        092     05c     0134    01011100        \
        093     05d     0135    01011101        ]
        094     05e     0136    01011110        ^
        095     05f     0137    01011111        _
        096     060     0140    01100000        `
        097     061     0141    01100001        a
        098     062     0142    01100010        b
        099     063     0143    01100011        c
        100     064     0144    01100100        d
        101     065     0145    01100101        e
        102     066     0146    01100110        f
        103     067     0147    01100111        g
        104     068     0150    01101000        h
        105     069     0151    01101001        i
        106     06a     0152    01101010        j
        107     06b     0153    01101011        k
        108     06c     0154    01101100        l
        109     06d     0155    01101101        m
        110     06e     0156    01101110        n
        111     06f     0157    01101111        o
        112     070     0160    01110000        p
        113     071     0161    01110001        q
        114     072     0162    01110010        r
        115     073     0163    01110011        s
        116     074     0164    01110100        t
        117     075     0165    01110101        u
        118     076     0166    01110110        v
        119     077     0167    01110111        w
        120     078     0170    01111000        x
        121     079     0171    01111001        y
        122     07a     0172    01111010        z
        123     07b     0173    01111011        {
        124     07c     0174    01111100        |
        125     07d     0175    01111101        }
        126     07e     0176    01111110        ~
        127     07f     0177    01111111        DEL

Hope this helps some.

astanton · 08-04-2011, 09:37 AM

Quote:

Originally Posted by Richard Cranium

Well, when in doubt, look at the source.

From the file src/names.h in the file-5.05 source tarball

....

The code appears to look for 2 instances of "[Tt]he" to decide if it is looking at a java program or not. Look at the names array in the same file immediately after the above quoted comment.

oic...

And what Gazl was saying now makes complete sense to me too

How weird is that? Well, problem solved. It seems Cadence has a problem with a correctly formatted hosts file after all, even though they say you now can have your FQDN in there and not just the simple hostname. The regular "Redhat" way of listing the hostname on the 127 line, although incorrect, is apparently still what Cadence wants, contrary to what they're now saying about their product.

the the the the the the the LOL.

I'm marking this thread as solved, so at least there's something for Google to hit next time someone gets stumped on this [almost but not quite] non-issue.

Thanks everyone!

.

MTK358 · 08-04-2011, 10:22 AM

Quote:

Originally Posted by astanton

I'm marking this thread as solved, so at least there's something for Google to hit next time someone gets stumped on this [almost but not quite] non-issue.

Why would solved threads not show up on Google?

astanton · 08-04-2011, 10:21 PM

solved threads show up on Google. Solved threads also tend to indicate that if you follow a link to that resource it might provide you with a resolution to your questions too.

Google was almost completely devoid of any discussion on this matter, however, so a solved thread showing up in search results might be a bonus for the next person.

MTK358 · 08-05-2011, 06:14 AM

I misread your prevoius post, I thought that it said that you are not marking it as solved, so that it will still show up on Google.