LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Script to rename files according to filetype (https://www.linuxquestions.org/questions/linux-newbie-8/script-to-rename-files-according-to-filetype-783696/)

SilversleevesX 01-21-2010 03:58 AM

Script to rename files according to filetype
 
I get a bit lazy using cUrl, often letting downloads go without adding the proper extension to images and whatnot. Yes I know about the -O flag, but I'm the kind who likes to give my files unique names right off the bat (until they puzzle out a way to "stamp" files with originating URLs across platforms, Classic MacOS Web-browser style, curl -o is the way to go).

So I was hoping someone could help me write a script (bash or python, doesn't matter which) that used the file command to look inside files in a single directory and then rename the ones it happened across that didn't have a 3- or 4-letter extension, appending the one that corresponded with the mime type. But this action only to be performed on files with no extension.

I know Linux and Unix don't bother much with extensions. I started home computing on a Mac IIcx -- that kind of intuition is beyond natural for me and very much appreciated. But I, and a few of my friends and family, also use Windows where file extensions are an idiot-proof way, if nothing else good can be said about the two, to get file X to open in application Y and not Q, W or Z (or, worse for some, no application at all).

I've seen a few scripts in different places, but they all seem to have the flaw of renaming files that already have an extension.

Much obliged for any help/pointers in the right direction.

BZT

Agrouf 01-21-2010 04:57 AM

Code:

ls | grep -v "\....." | grep -v "\...." | while read f; do
  mv $f $f.$(file -i $f | sed "s/.*\///g")
done


jschiwal 01-21-2010 05:33 AM

Agrouf: Did you test your script? I always use "echo" before a command like mv to make sure it works ok.
Code:

ls | grep -v "\....." | grep -v "\...." | while read f; do echo  mv $f $f.$(file -i $f | sed "s/.*\///g"); done
mv 020209_christianbale 020209_christianbale.mpeg; charset=binary
mv 111-1112_IMG 111-1112_IMG.jpeg; charset=binary
mv 111-1116_IMG 111-1116_IMG.jpeg; charset=binary
mv tllts_330-12-02-09 tllts_330-12-02-09.mpeg; charset=binary
mv tllts_331-12-09-09 tllts_331-12-09-09.mpeg; charset=binary


Some items are close however. The $f variable needs double quoting.
The results for the tllts pocasts are mpeg instead of mp3 because the description of mp3's are formal.

A multi-line sed or awk script would probably work out better.
file okvfi | sed '/JPEG image data/s/\(.*\):.*/mv "\1" "\1.jpg"/'
mv "okvfi" "okvfi.jpg"
file tllts_330-12-02-09 | sed '/MPEG ADTS, layer III/s/\(^[^:]*\):.*$/mv "\1" "\1.mp3"/'
mv "tllts_330-12-02-09" "tllts_330-12-02-09.mp3"

Here I found a mistake in my first line. The description could contain a ":", so I used "\(^[^:]*\):" to match the filename.

So the sed script would look like:
Code:

#n
/JPEG image data/s/\(.*\):.*$/mv "\1" "\1.jpg"/p
/ISO Media, MPEG v4/s/\(^[^:]*\):.*$/mv "\1" "\1.mp4"/p
/MPEG ADTS, layer III/s/\(^[^:]*\):.*$/mv "\1" "\1.mp3"/p

...


Create a line for each mime type and test it. Then add the sed command to a sed script.
Finally create a script and inspect it. I'd also recommend moving renamed files to another directory instead of simply renaming them.

/JPEG image data/s/\(.*\):.*/mv "\1" images\/"\1.jpg"/'

Start with a list of files without extensions:

I would first generate a list of files and then run the sed script on it.

find . -type f -not -name "*.*" -print0 | xargs -0 file >descriptions
sed -f rename.sed descriptions >add_extensions.sh

this generates a script like this:
Code:

mv "./photo_2a" "./photo_2a.jpg"
mv "./okatx" "./okatx.jpg"
mv "./Toto - Africa" "./Toto - Africa.mp4"
mv "./okvfi" "./okvfi.jpg"
...


Agrouf 01-21-2010 06:50 AM

Quote:

Originally Posted by jschiwal (Post 3834788)
Agrouf: Did you test your script?

I tested on RHEL 5.4, with file 4.17. It does not output the charset.
I posted the script without thinking too much if it always works or not though. I expect people to test it before using it. I should put that in my signature.
Quote:

I always use "echo" before a command like mv to make sure it works ok.
Very good idea.

Never ever trust a script you find on a forum on the internet. If you don't understand it, don't use it and if you do understand it, test it before using it for real. In any case, always use your brain when dealing with your data. The brain of anonymous people on the internet is not a substitute for your own, never.

neonsignal 01-21-2010 06:57 AM

Unholy one liner.

Code:

find . -type f -regex '.*/[^.]*' -exec sh -c \
'mv "{}" "{}"$(grep $(file -i -b "{}" | grep -o "^[^ ]*") /usr/share/mime/globs | tail -1 | grep -o "[^*]*$")' \;

For each file in the directory or subdirectories that doesn't have a period in the name (find . -type f -regex '.*/[^.]*'), this performs a move command (-exec sh -c 'mv "{}" "{}"ext' \;).

The script for the extension works by grabbing the mime type (file -i -b), stripping any trailing stuff after the space, and looking up the last entry in the globs database to find the extension.

Yes I know, it would all go better with Sed/Awk/Perl.

jschiwal 01-21-2010 08:33 AM

neonsignal:

Code:

mv ./photo_2a ./photo_2a
mv ./okatx ./okatx
mv ./Toto - Africa ./Toto - Africa
mv ./okvfi ./okvfi
mv ./hpr0426 ./hpr0426
mv ./111-1116_IMG ./111-1116_IMG
mv ./photo_7a ./photo_7a
...

I did look at the mime type output, but saw a lot of overlap. 451 out of 867 entries.
Perhaps using -i and separate awk or sed commands would work OK.

Code:

video/mpeg:*.m2t
video/mpeg:*.mp2
video/mpeg:*.mpe
video/mpeg:*.mpeg
video/mpeg:*.mpg
video/mpeg:*.vob


jschiwal 01-21-2010 09:48 AM

1 Attachment(s)
neonsignal inspired me to generate sed commands from the glob file.

Run it like:
find . -not -name "*.*" -printf "%f\n" | tr '\n' '\0' | xargs -0 file -i | sed -f addext.sed >add_extensions.sh

Comment out extentions you don't want (in addext.sed) such as "jpe" and "jpeg" if you want "jpg" instead.
Reading the generated shell script will cue you in on what you may want to comment out. Look for that extension. What you want to comment is probably right above it.

SilversleevesX 01-23-2010 02:23 AM

some code from early on the 21st worked, almost.
 
Thanks to you all for putting in the effort to get some kind of script puzzled out to help me with this.

Using jschiwal's short list of MIME-types for my rename.sed file, I executed the find command he suggested, thus:

find . -type f -not -name "*.*" -print0 | xargs -0 file >descriptions

...which gave me a descriptions file that included one line in the list that looked like this:

Code:

./ex----e-31-51407-0:  JPEG image data, JFIF standard 1.01, comment: "CREATOR: gd-jpeg v1.0 (using IJ"
...which my add_extensions.sh script could not act upon, as there was too much information to rename it with a file extension. The generated add_extensions.sh script (since corrected) had as its corresponding line one that ended with the extra "COMMENT:" info instead of a discrete name ending in ".jpg" mv gave an error saying it had no destination (or new name) to apply to the file.

It looks as though, for a few mime-types, some more precise string isolation (sed? awk?) will be necessary. When I do file -i on the command-line, it returns data separated by a semicolon and one space

>> file -i foo.jpg
foo.jpg: image/jpeg; charset=binary

...yet in the descriptions file, the format was different, but only in terms of the punctuation (CLI returned with semicolon; lines in description file used commas,). Fair to call these delimiters? In any case, whichever is easier to parse, the failure-proof way to go about it would be to either terminate the "file-i" return after the first entry (either JPEG image data -or- image/jpeg in my case) and pass that much on to the add_extensions.sh script in the form of a mv command. Can a print0 | xargs -o pairing "cut it any finer" than in jschiwal's code, or would we have to go with something else?

Again, thanks for all the help.

BZT

SilversleevesX 01-26-2010 09:56 PM

Reading further (and downloading along with!)...
 
...I notice that the snag described in my last post is overcome with jschiwal's very latest code (post from 21 Jan @ 10:48AM) as well as his superbly complete addext.sed file. Many thanks for both.

But why "comment out" file extensions you don't want to use; why not simply delete them from the sed file altogether? of course, you'll want to cp a backup or "twin" the original into another directory. If all else fails, and you lose track of said backups, log on to LQ Forum and dl another copy.

But while I'm on the subject: would one use # marks as with most "commenting out" in shell scripts and other files (.bash_profile comes to mind unbidden), or something else?

BZT

SilversleevesX 01-28-2010 12:30 AM

Some JPEG2000 files may be a problem.
 
I observed there are two different descriptions returned for them from the file command for JPeg2000 files created by XnView.

file grut.jp2 returns
Code:

JPEG-2000 Code Stream Bitmap data
...while file -i grut.jp2 gives back
Code:

application/octet-stream; charset=binary
As the command from jschiwal's post (late on the 21st) includes the latter usage of file, and that "application/octet-stream," in addext.sed is tied to the extension .bin, this means that JPEG2000s will end up with that extension, instead of the correct one for that file type.

Again, to this point, I've only observed this with files saved as JPEG2000 in XnView. I'll probably come back to comment or edit this post when I've been able to see what other graphics apps' JP2s "look like" on the command line. If I don't, one can safely assume that JPEG2000s may be a problem, and should be double-checked by some means or method outside the scripts and commands we've been puzzling out in this thread.

BZT

SilversleevesX 02-28-2010 02:34 PM

Reviving the thread to ask a few questions.
 
I got something of a positive answer to the question implied (inferred? never got those right) in my previous post here.


Quote:

Originally Posted by SilversleevesX (Post 3843206)
I observed there are two different descriptions returned for them from the file command for JPeg2000 files created by XnView.

file grut.jp2 returns
Code:

JPEG-2000 Code Stream Bitmap data
...while file -i grut.jp2 gives back
Code:

application/octet-stream; charset=binary
<snip>

BZT

To quote (and save you a jump off-forum):
Quote:

Originally Posted by MItaly, IrfanView Forums
The two different outputs for me are due to the fact that the -i switch makes file output a mime string, that should be compatible with most clients; since JPEG2000 isn't widely supported, it's passed simply as a generic octet stream.

Which puts one of the script versions in a somewhat awkward place, since when I look for an entry (in my slightly-trimmed copy of jschiwal's excellent addext.sed) for /application/octet-stream/ , I see that it's matched with the extension .bin. I remember some mention was made of the fact that several MIME types returned by file -i commands returned identical strings, and it's likely this was one of those times.

For myself it's no problem. I generally don't use cUrl anywhere that I'm all too likely to find .bin files (or even .jp2s for that matter). However, I conceived of this script as one for a greater number of users than just myself, so I wanted to know:
  • How likely is it, or how often does it happen, that other L/Unix users run into ".bin" files that aren't already clearly marked as such in an archive or installer?
The less likely or less often, the better, to be blunt about it. This might leave the field open, so to say, to change that MIME entry in addext.sed permanently to ".jp2".

All right, one obstacle out of the way. Now to tackle the annoying libgd comment tag.

BZT

Tinkster 02-28-2010 03:18 PM

Quote:

Originally Posted by SilversleevesX (Post 3880220)
  • How likely is it, or how often does it happen, that other L/Unix users run into ".bin" files that aren't already clearly marked as such in an archive or installer?
The less likely or less often, the better, to be blunt about it. This might leave the field open, so to say, to change that MIME entry in addext.sed permanently to ".jp2".

For what it's worth: I don't have a single JPEG2000 file
among my downloads, but *.deb., *.exe., *.mdi and varied
other data files including passwordsafe data match the
octet-binary combo.


I think you should be taking a two-step approach, tackle
files that have clear matches using the 'file -i' method,
and fall-back to 'file' where octet-binary is reported by
the first.



Cheers,
Tink

SilversleevesX 02-28-2010 03:31 PM

Found: a way to 'ignore' all comment tags
 
Quote:

Originally Posted by SilversleevesX (Post 3837436)
Thanks to you all for putting in the effort to get some kind of script puzzled out to help me with this.

Using jschiwal's short list of MIME-types for my rename.sed file, I executed the find command he suggested, thus:

find . -type f -not -name "*.*" -print0 | xargs -0 file >descriptions

...which gave me a descriptions file that included one line in the list that looked like this:

Code:

./ex----e-31-51407-0:  JPEG image data, JFIF standard 1.01, comment: "CREATOR: gd-jpeg v1.0 (using IJ"
...which my add_extensions.sh script could not act upon, as there was too much information to rename it with a file extension. The generated add_extensions.sh script (since corrected) had as its corresponding line one that ended with the extra "COMMENT:" info instead of a discrete name ending in ".jpg" mv gave an error saying it had no destination (or new name) to apply to the file.

That was a puzzler, indeed.


Quote:

Originally Posted by SilversleevesX (Post 3837436)
It looks as though, for a few mime-types, some more precise string isolation (sed? awk?) will be necessary.

Neither, as it turns out. file -b and cut -d, -f1 work just as well. And this also eliminates, to some degree, the problem of duplication of mime types turned out by file -i. Except for such things as previous "descriptions" files, so far file -b gives a nice description of every file type I've tested it on. Chops to the folks who keep magic.mgc current!

So anyway, I tried file -b and cut -d, -f1 on the script that went:
Code:

find . -type f -not -name "*.*" -print0 | xargs -0 file -b | cut -d, -f1>descriptions
sed -f rename.sed descriptions >add_extensions.sh

and ended up with a four-line descriptions file, but an empty add_extensions.sh file.

Maybe the rename.sed file, edited again to look like this...
Code:

#n
/JPEG image data/s/\(.*\):.*$/mv "\1" "\1.jpg"/p
/JPEG-2000 Code Stream Bitmap data/s/\(.*\):.*$/mv "\1" "\1.jp2"/p
/PNG image/s/\(.*\):.*$/mv "\1" "\1.png"/p
/TIFF image data/s/\(.*\):.*$/mv "\1" "\1.tif"/p
/ISO Media, MPEG v4/s/\(^[^:]*\):.*$/mv "\1" "\1.mp4"/p
/MPEG ADTS, layer III/s/\(^[^:]*\):.*$/mv "\1" "\1.mp3"/p

...needs a bit of tweaking?

BZT

SilversleevesX 03-01-2010 07:42 AM

Still can't get anything in addextensions.sh!
 
I thought it might have to do with the filename being missing. The sed file obviously needs an original name (w/o extension) for every file it finds, or else the 'mv' command prescribed in rename.sed doesn't know which file to work on.

I also wondered whether or not it had to do with the length of the string the script was processing the descriptions file by the rename.sed script's rules.

So I went back to my previous idea of "fine-tuning" a vanilla "file" command. The following is what I came up with to "trim down" its output to strictly something the script could use without getting tangled by misplaced (read: libgd and some websites') comment headers in JPEG files.
Code:

OLDIFS=$IFS
IFS=$'\ \t\n'
typon=$(file -F : hv7918-861)
nameit=$(echo -ne $typon | cut -d: -f1)
relevant=$(echo -ne $typon | cut -d: -f2)
crucial=$(echo -ne $relevant | cut -d, -f1)
forwardpass=$(echo -ne "$nameit: $crucial")
echo "$forwardpass is ${#forwardpass} characters long."
IFS=$OLDIFS

The variable forwardpass is so named because I think this is as much data as the sed script needs to create commands in add_extensions.sh. In other words, what to pass forward to make that shell script.

Somewhere this code breaks down. I'm pretty sure I'm on the right track, but there's no doubt something's catching me up.

Advice, please?

BZT

SilversleevesX 03-01-2010 03:23 PM

Worked out the bugs, broke away from "find"
 
1 Attachment(s)
The working and final (I hope) script is a touch slower by reason of not using the find command, but by moving to a for-do-done loop, I was able to introduce the variables from that "play" script in my previous post. The immediate result was a descriptions file that contained the filename and the next two fields returned by find, formatted identically to what was produced by previous versions of the script. The other result was an add_extensions.sh file that actually had commands in it.

To close out this thread, I'm pasting in the contents of my edited rename.sed file and attaching the final script as a text file.

Code:

#n
/JPEG image data/s/\(.*\):.*$/mv "\1" "\1.jpg"/p
/JPEG-2000 Code Stream Bitmap data/s/\(.*\):.*$/mv "\1" "\1.jp2"/p
/PNG image/s/\(.*\):.*$/mv "\1" "\1.png"/p
/TIFF image data/s/\(.*\):.*$/mv "\1" "\1.tif"/p
/ISO Media, MPEG v4/s/\(^[^:]*\):.*$/mv "\1" "\1.mp4"/p
/MPEG ADTS, layer III/s/\(^[^:]*\):.*$/mv "\1" "\1.mp3"/p

Thanks to all the members who helped me get this to a point of (if not at then very near) completion.

BZT


All times are GMT -5. The time now is 08:27 PM.