LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   "file" command not working properly for UTF8 files (https://www.linuxquestions.org/questions/linux-newbie-8/file-command-not-working-properly-for-utf8-files-765509/)

mcasuman 10-30-2009 03:09 AM

"file" command not working properly for UTF8 files
 
I have some files which are UTF8 and have first line as blank. It shows as data file by "file" command. If I remove the blank line, it shows it as UTF8 text file. I have to choose displayable text files from many files, so I used "file" command.
Do anyone have any idea how to do it with any other command?
Note: I do not have permission to attach file. So not able to attach sample file here.

neonsignal 10-30-2009 07:43 AM

Quote:

So not able to attach sample file here.
Is it at least possible to supply the first dozen bytes (using hexdump)? What language are the files in?

Quote:

I have some files which are UTF8 and have first line as blank. It shows as data file by "file" command. If I remove the blank line, it shows it as UTF8 text file.
I can't replicate this - I have tried with/without blank lines, and with/without byte order marks, and file identifies the file correctly (ie, never falling back to 'data').

You don't supply any details about the system (versions etc). I would assume this is on a Linux system?

Quote:

Do anyone have any idea how to do it with any other command?
You could make use of strings to look for printable text, or hexdump to look for a byte order mark (if there is one), but file is really the correct tool for the job.

mcasuman 11-02-2009 12:38 AM

I am working on a Linux system.Here is the full file hexdump:
0000000 470a 4f4c 4547 2052 202b 4942 4b52 2045
0000010 4d47 4842 5320 5359 4554 544d 4345 4e48
0000020 4b49 450a 5058 504f 5241 204b 3131 310a
0000030 3235 3633 4a20 4341 424f 4453 524f 0a46
0000040 6547 6d72 6e61 0a79 3032 3930 312d 2d30
0000050 3732 570a 6369 7468 6769 2065 6e49 6f66
0000060 6d72 7461 6f69 656e 206e c366 72bc 7520
0000070 736e 7265 2065 754b 646e 6e65 0a3a 530a
0000080 6369 6568 6872 6965 7374 6164 6574 626e
0000090 c36c 74a4 6574 2072 6567 c36d c3a4 209f
00000a0 6556 6f72 6472 756e 676e 2820 4745 2029
00000b0 724e 202e 3931 3730 322f 3030 2e36 0a0a
00000c0 6553 7268 6720 6565 7268 6574 4420 6d61
00000d0 6e65 7520 646e 4820 7265 6572 2c6e 0a0a
00000e0 6553 7268 6720 6565 7268 6574 2072 754b
00000f0 646e 2c65 7720 7269 6420 6e61 656b 206e
0000100 6849 656e 206e 6562 7473 6e65 2073 c366
0000110 72bc 4920 7268 2065 6542 7473 6c65 756c
0000120 676e 202e 6953 2065 7265 6168 746c 6e65
0000130 6620 bcc3 0a72 656a 6564 2073 7250 646f
0000140 6b75 2074 6965 206e 6953 6863 7265 6568
0000150 7469 6473 7461 6e65 6c62 7461 2c74 6420
0000160 7361 4920 7268 6e65 6e20 7461 6f69 616e
0000170 656c 206e 6547 6573 7a74 6e65 6520 746e
0000180 7073 6972 6863 2e74 0a0a 6144 2073 6953
0000190 6863 7265 6568 7469 6473 7461 6e65 6c62
00001a0 7461 2074 6e65 6874 a4c3 746c 7720 6369
00001b0 7468 6769 2065 6e49 6f66 6d72 7461 6f69
00001c0 656e 206e 757a 206d 6547 7573 646e 6568
00001d0 7469 7373 6863 7475 207a 6e75 0a64 757a
00001e0 2072 7250 646f 6b75 7374 6369 6568 6872
00001f0 6965 2e74 5720 7269 7220 7461 6e65 202c
0000200 6f76 2072 6945 736e 7461 207a 6564 2073
0000210 614d 6574 6972 6c61 2073 6164 2073 6144
0000220 6574 626e 616c 7474 6520 6e69 6567 6568
0000230 646e 7a0a 2075 656c 6573 206e 6e75 2064
0000240 7365 6120 206e 6c61 656c 5020 7265 6f73
0000250 656e 2c6e 6420 6569 6d20 7469 6420 6d65
0000260 5020 6f72 7564 746b 7520 676d 6865 6e65
0000270 7520 646e 6620 bcc3 2072 6573 6e69 6e65
0000280 450a 6e69 6173 7a74 7620 7265 6e61 7774
0000290 726f 6c74 6369 2068 6973 646e 202c 6577
00002a0 7469 7265 757a 656c 7469 6e65 0a2e 530a
00002b0 6c6f 746c 6e65 5320 6569 7520 736e 7265
00002c0 2065 7250 646f 6b75 6574 7720 6965 6574
00002d0 7672 7265 616b 6675 6e65 202c 6973 646e
00002e0 5320 6569 7620 7265 ef70 82ac 6369 7468
00002f0 7465 202c 6849 6572 206e 754b 646e 6e65
0000300 6520 6e69 0a65 6f4b 6970 2065 6964 7365
0000310 7365 5320 6369 6568 6872 6965 7374 6164
0000320 6574 626e 616c 7474 7365 7a20 2075 bcc3
0000330 6562 6c72 7361 6573 2e6e 0a0a 6942 7474
0000340 2065 6168 6562 206e 6953 2065 6556 7372
0000350 c374 6ea4 6e64 7369 202c 6164 7373 6420
0000360 6569 6573 2073 6353 7268 6965 6562 206e
0000370 616d 6373 6968 656e 6c6c 6520 7372 6574
0000380 6c6c 2074 7577 6472 2065 6e75 0a64 6f73
0000390 696d 2074 696e 6863 2074 6e75 6574 7372
00003a0 6863 6972 6265 6e65 6920 7473 0a2e 4d0a
00003b0 7469 6620 6572 6e75 6c64 6369 6568 206e
00003c0 7247 bcc3 9fc3 6e65 490a 7268 4820 7265
00003d0 7473 6c65 656c 2f72 694c 6665 7265 6e61
00003e0 0a74 0a0a
00003e4

neonsignal 11-02-2009 02:04 AM

By coincidence the first three characters in the file (0Ah 47h 4Ch = "<lf>GL") are the same as the header that was used on the Apple II binary files, which is why it is being identified as an Apple II binary file. If I remember rightly, Apple used a <cr> as their line separator in text files, so it would not have been ambiguous (whereas Unix-like systems use <lf>).

The brute force solution is to turn off checking for all the sequences in the magic file:
Code:

file -esoft *
That should be fine if you just want to identify text files. If it doesn't work for all the filetypes you wish to identify, then you can take the more subtle approach of modifying the magic file. Make a copy of the magic file:
Code:

cp /usr/share/file/magic magic
Then remove the line that says:
Code:

0        string                \x0aGL                        Binary II (apple ][) data
Then do the file testing using this new magic file:
Code:

file -mmagic *

mcasuman 11-03-2009 12:17 AM

Quote:

file -esoft *
This command is not working for me. By any chance is there any typing error? I think if I turn off checking of all the sequences it will be enough for me.

neonsignal 11-03-2009 01:16 AM

Quote:

This command is not working for me. By any chance is there any typing error? I think if I turn off checking of all the sequences it will be enough for me.
There isn't a typo, but we might have different versions of 'file'. I am running Debian Lenny, and it is file version 4.26 (file -v).


All times are GMT -5. The time now is 11:53 PM.