LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices



Reply
 
Search this Thread
Old 10-30-2009, 04:09 AM   #1
mcasuman
LQ Newbie
 
Registered: Aug 2008
Posts: 3

Rep: Reputation: 0
"file" command not working properly for UTF8 files


I have some files which are UTF8 and have first line as blank. It shows as data file by "file" command. If I remove the blank line, it shows it as UTF8 text file. I have to choose displayable text files from many files, so I used "file" command.
Do anyone have any idea how to do it with any other command?
Note: I do not have permission to attach file. So not able to attach sample file here.
 
Old 10-30-2009, 08:43 AM   #2
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Wheezy (Fluxbox WM)
Posts: 1,368
Blog Entries: 52

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
Quote:
So not able to attach sample file here.
Is it at least possible to supply the first dozen bytes (using hexdump)? What language are the files in?

Quote:
I have some files which are UTF8 and have first line as blank. It shows as data file by "file" command. If I remove the blank line, it shows it as UTF8 text file.
I can't replicate this - I have tried with/without blank lines, and with/without byte order marks, and file identifies the file correctly (ie, never falling back to 'data').

You don't supply any details about the system (versions etc). I would assume this is on a Linux system?

Quote:
Do anyone have any idea how to do it with any other command?
You could make use of strings to look for printable text, or hexdump to look for a byte order mark (if there is one), but file is really the correct tool for the job.
 
Old 11-02-2009, 01:38 AM   #3
mcasuman
LQ Newbie
 
Registered: Aug 2008
Posts: 3

Original Poster
Rep: Reputation: 0
I am working on a Linux system.Here is the full file hexdump:
0000000 470a 4f4c 4547 2052 202b 4942 4b52 2045
0000010 4d47 4842 5320 5359 4554 544d 4345 4e48
0000020 4b49 450a 5058 504f 5241 204b 3131 310a
0000030 3235 3633 4a20 4341 424f 4453 524f 0a46
0000040 6547 6d72 6e61 0a79 3032 3930 312d 2d30
0000050 3732 570a 6369 7468 6769 2065 6e49 6f66
0000060 6d72 7461 6f69 656e 206e c366 72bc 7520
0000070 736e 7265 2065 754b 646e 6e65 0a3a 530a
0000080 6369 6568 6872 6965 7374 6164 6574 626e
0000090 c36c 74a4 6574 2072 6567 c36d c3a4 209f
00000a0 6556 6f72 6472 756e 676e 2820 4745 2029
00000b0 724e 202e 3931 3730 322f 3030 2e36 0a0a
00000c0 6553 7268 6720 6565 7268 6574 4420 6d61
00000d0 6e65 7520 646e 4820 7265 6572 2c6e 0a0a
00000e0 6553 7268 6720 6565 7268 6574 2072 754b
00000f0 646e 2c65 7720 7269 6420 6e61 656b 206e
0000100 6849 656e 206e 6562 7473 6e65 2073 c366
0000110 72bc 4920 7268 2065 6542 7473 6c65 756c
0000120 676e 202e 6953 2065 7265 6168 746c 6e65
0000130 6620 bcc3 0a72 656a 6564 2073 7250 646f
0000140 6b75 2074 6965 206e 6953 6863 7265 6568
0000150 7469 6473 7461 6e65 6c62 7461 2c74 6420
0000160 7361 4920 7268 6e65 6e20 7461 6f69 616e
0000170 656c 206e 6547 6573 7a74 6e65 6520 746e
0000180 7073 6972 6863 2e74 0a0a 6144 2073 6953
0000190 6863 7265 6568 7469 6473 7461 6e65 6c62
00001a0 7461 2074 6e65 6874 a4c3 746c 7720 6369
00001b0 7468 6769 2065 6e49 6f66 6d72 7461 6f69
00001c0 656e 206e 757a 206d 6547 7573 646e 6568
00001d0 7469 7373 6863 7475 207a 6e75 0a64 757a
00001e0 2072 7250 646f 6b75 7374 6369 6568 6872
00001f0 6965 2e74 5720 7269 7220 7461 6e65 202c
0000200 6f76 2072 6945 736e 7461 207a 6564 2073
0000210 614d 6574 6972 6c61 2073 6164 2073 6144
0000220 6574 626e 616c 7474 6520 6e69 6567 6568
0000230 646e 7a0a 2075 656c 6573 206e 6e75 2064
0000240 7365 6120 206e 6c61 656c 5020 7265 6f73
0000250 656e 2c6e 6420 6569 6d20 7469 6420 6d65
0000260 5020 6f72 7564 746b 7520 676d 6865 6e65
0000270 7520 646e 6620 bcc3 2072 6573 6e69 6e65
0000280 450a 6e69 6173 7a74 7620 7265 6e61 7774
0000290 726f 6c74 6369 2068 6973 646e 202c 6577
00002a0 7469 7265 757a 656c 7469 6e65 0a2e 530a
00002b0 6c6f 746c 6e65 5320 6569 7520 736e 7265
00002c0 2065 7250 646f 6b75 6574 7720 6965 6574
00002d0 7672 7265 616b 6675 6e65 202c 6973 646e
00002e0 5320 6569 7620 7265 ef70 82ac 6369 7468
00002f0 7465 202c 6849 6572 206e 754b 646e 6e65
0000300 6520 6e69 0a65 6f4b 6970 2065 6964 7365
0000310 7365 5320 6369 6568 6872 6965 7374 6164
0000320 6574 626e 616c 7474 7365 7a20 2075 bcc3
0000330 6562 6c72 7361 6573 2e6e 0a0a 6942 7474
0000340 2065 6168 6562 206e 6953 2065 6556 7372
0000350 c374 6ea4 6e64 7369 202c 6164 7373 6420
0000360 6569 6573 2073 6353 7268 6965 6562 206e
0000370 616d 6373 6968 656e 6c6c 6520 7372 6574
0000380 6c6c 2074 7577 6472 2065 6e75 0a64 6f73
0000390 696d 2074 696e 6863 2074 6e75 6574 7372
00003a0 6863 6972 6265 6e65 6920 7473 0a2e 4d0a
00003b0 7469 6620 6572 6e75 6c64 6369 6568 206e
00003c0 7247 bcc3 9fc3 6e65 490a 7268 4820 7265
00003d0 7473 6c65 656c 2f72 694c 6665 7265 6e61
00003e0 0a74 0a0a
00003e4
 
Old 11-02-2009, 03:04 AM   #4
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Wheezy (Fluxbox WM)
Posts: 1,368
Blog Entries: 52

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
By coincidence the first three characters in the file (0Ah 47h 4Ch = "<lf>GL") are the same as the header that was used on the Apple II binary files, which is why it is being identified as an Apple II binary file. If I remember rightly, Apple used a <cr> as their line separator in text files, so it would not have been ambiguous (whereas Unix-like systems use <lf>).

The brute force solution is to turn off checking for all the sequences in the magic file:
Code:
file -esoft *
That should be fine if you just want to identify text files. If it doesn't work for all the filetypes you wish to identify, then you can take the more subtle approach of modifying the magic file. Make a copy of the magic file:
Code:
cp /usr/share/file/magic magic
Then remove the line that says:
Code:
0	string		\x0aGL			Binary II (apple ][) data
Then do the file testing using this new magic file:
Code:
file -mmagic *

Last edited by neonsignal; 11-02-2009 at 05:39 AM.
 
Old 11-03-2009, 01:17 AM   #5
mcasuman
LQ Newbie
 
Registered: Aug 2008
Posts: 3

Original Poster
Rep: Reputation: 0
Quote:
file -esoft *
This command is not working for me. By any chance is there any typing error? I think if I turn off checking of all the sequences it will be enough for me.
 
Old 11-03-2009, 02:16 AM   #6
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Wheezy (Fluxbox WM)
Posts: 1,368
Blog Entries: 52

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
Quote:
This command is not working for me. By any chance is there any typing error? I think if I turn off checking of all the sequences it will be enough for me.
There isn't a typo, but we might have different versions of 'file'. I am running Debian Lenny, and it is file version 4.26 (file -v).
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
mencoder "-vf scale" not working properly zuzoa Linux - Software 2 07-31-2009 07:32 PM
"System V" scripts not working properly sometimes rbauhaus Linux - Newbie 3 03-15-2009 09:39 AM
Command "mail" returns "panic: temporary file seek" kenneho Linux - Software 5 12-23-2008 04:27 AM
my Fedora Box not working properly with the message " INPUT NOT SUPPORTED" siri.siri143 Linux - Newbie 3 04-15-2008 11:16 PM
"w" command not working properly blizunt7 Linux - General 3 12-17-2005 05:07 PM


All times are GMT -5. The time now is 08:26 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration