Massively renaming numerous similar files

khermans · 06-28-2004, 07:36 PM

I am trying to read the man pages of gawk and sed to rename a whole slew of files, but I am still having trouble. The problem is that I am a little confused on the syntax. Basically, I have a whole bunch of *.r## and *.p## files, where ## are two digit numbers. I want to change just a few of the characters in the original filenames. See below for example:

This is what they look like:
--snip--
$ ls -a *.p* && ls -a *.r*
smr_DB-ts_1of2_.p01 smr_DB-ts_1of2_.p05 smr_DB-ts_2of2_.p03
smr_DB-ts_1of2_.p02 smr_DB-ts_1of2_.par smr_DB-ts_2of2_.p04
smr_DB-ts_1of2_.p03 smr_DB-ts_2of2_.p01 smr_DB-ts_2of2_.p05
smr_DB-ts_1of2_.p04 smr_DB-ts_2of2_.p02 smr_DB-ts_2of2_.par
smr_DB-ts_1of2_.r00 smr_DB-ts_1of2_.r16 smr_DB-ts_2of2_.r08
smr_DB-ts_1of2_.r01 smr_DB-ts_1of2_.r17 smr_DB-ts_2of2_.r09
smr_DB-ts_1of2_.r02 smr_DB-ts_1of2_.r18 smr_DB-ts_2of2_.r10
smr_DB-ts_1of2_.r03 smr_DB-ts_1of2_.r19 smr_DB-ts_2of2_.r11
smr_DB-ts_1of2_.r04 smr_DB-ts_1of2_.r20 smr_DB-ts_2of2_.r12
smr_DB-ts_1of2_.r05 smr_DB-ts_1of2_.r21 smr_DB-ts_2of2_.r13
smr_DB-ts_1of2_.r06 smr_DB-ts_1of2_.r22 smr_DB-ts_2of2_.r14
smr_DB-ts_1of2_.r07 smr_DB-ts_1of2_.rar smr_DB-ts_2of2_.r15
smr_DB-ts_1of2_.r08 smr_DB-ts_2of2_.r00 smr_DB-ts_2of2_.r16
smr_DB-ts_1of2_.r09 smr_DB-ts_2of2_.r01 smr_DB-ts_2of2_.r17
smr_DB-ts_1of2_.r10 smr_DB-ts_2of2_.r02 smr_DB-ts_2of2_.r18
smr_DB-ts_1of2_.r11 smr_DB-ts_2of2_.r03 smr_DB-ts_2of2_.r19
smr_DB-ts_1of2_.r12 smr_DB-ts_2of2_.r04 smr_DB-ts_2of2_.r20
smr_DB-ts_1of2_.r13 smr_DB-ts_2of2_.r05 smr_DB-ts_2of2_.r21
smr_DB-ts_1of2_.r14 smr_DB-ts_2of2_.r06 smr_DB-ts_2of2_.r22
smr_DB-ts_1of2_.r15 smr_DB-ts_2of2_.r07 smr_DB-ts_2of2_.rar
--snip--

I want the above to be transformed into this below (shortened, but you gewt the idea that I want to stick in the parentheses where necessary on every file and take out unnecessary characters):
--snip--
(smr)DB-ts(1of2).p01
(smr)DB-ts(1of2).p02
...
(smr)DB-ts(1of2).par
(smr)DB-ts(2of2).p01
(smr)DB-ts(2of2).p02
...
(smr)DB-ts(2of2).par
(smr)DB-ts(1of2).r00
(smr)DB-ts(1of2).r01
...
(smr)DB-ts(1of2).rar
(smr)DB-ts(2of2).r00
(smr)DB-ts(2of2).r01
...
(smr)DB-ts(2of2).rar
--snip--

Any ideas? Would gawk or sed be better for this purpose? I would like to do this directly from the command line. Thanks!

Kristian Hermansen

Dark_Helmet · 06-28-2004, 08:36 PM

You might want to check out the rename command: man rename

You would probably have to do multiple runs to adjust each piece, but it can do what you want.

mikshaw · 06-28-2004, 10:56 PM

You do realize that renaming multi-part RAR and PAR files will make the contents of the archive inaccessible?

khermans · 06-28-2004, 11:40 PM

Quote:

Originally posted by mikshaw
You do realize that renaming multi-part RAR and PAR files will make the contents of the archive inaccessible?

The problem is that they seem to have been renamed by the program that downloaded them, and thus the PAR files could not recover the RAR files since the names had changed! It looks as if somewhere along the line the parentheses "(" and ")" were not escaped correctly when writing the file, and must have defaulted to "_" because of this. I just wanted to know how to rename a whole bunch of files, and this is actually more of a general question than a specific one. I'm going to check out the rename command right now :-)

Kristian Hermansen

khermans · 06-29-2004, 12:22 AM

I dont think that rname command is what I am looking for. It needs to be a bit more complex than this, and keep the same basic filename structure with other characters interspereds - which is why I thought maybe sed or gawk might be useful here. Any ideas?

Kristian Hermansen

Dark_Helmet · 06-29-2004, 12:56 AM

Yes, rename will work:

Code:

rename smr \(smr smr*
rename smr_ smr\) *smr*
rename ts_ ts\( *ts_*
rename _.p \).p *_.p*
rename _.r \).r *_.r*

Those five commands will change every file you listed in the example to match the output you were looking to get.

khermans · 06-29-2004, 01:01 AM

Quote:

Originally posted by Dark_Helmet
Yes, rename will work:

Code:

rename smr $smr smr* rename smr_ smr$ *smr* rename ts_ ts$ *ts_* rename _.p $.p *_.p* rename _.r \).r *_.r*

Those five commands will change every file you listed in the example to match the output you were looking to get.

That's cool, thanks for the tip! Do you also know how to do it with a one line regex command?

Kristian Hermansen

Dark_Helmet · 06-29-2004, 01:09 AM

Not meaning to be overly nosy, but why does it need to be a one-line regex? You could just as easily put those rename commands into a script if you wanted a single command.

khermans · 06-29-2004, 01:27 AM

Quote:

Originally posted by Dark_Helmet
Not meaning to be overly nosy, but why does it need to be a one-line regex? You could just as easily put those rename commands into a script if you wanted a single command.

The problem is that I would be invoking the rename command once for EVERY instance of something I wanted to replace. This can be costly when you're dealing with millions of files in a database that you want to update immediately given some dynamically changing criteria. I want to be able to do this more efficiently, since calling the same progam numerous times on the same file would be unnecessary. What if I needed to make one hundred changes, should I call rename 100 times? You can see that a quick regex on a vast array of files might be more inexpensive (although is some cases probably not, given the regex complexity). Also, the scripting will be much easier if I can say something like "sed s/foo_/foo\)+ blah blah blah" rather than "rename foo_ foo\) foo* && rename blah && rename blah && rename blah && ..." I would also like to know how the equivalent is handled with regexp, since I'm not much of a regex whiz...lol

Kristian Hermansen

Dark_Helmet · 06-29-2004, 02:39 AM

I would avoid a single regex expression at all costs (for maintainability reasons) and break it up to something like this:

Code:

for file in *
do
  mv -v ${file} `echo ${file} | sed s/^smr/\(smr/ | sed s/smr_/smr\)/ | ... `
done

If you absolutely must have a one liner, which I'm still not completely convinced is necessary, you could try this:

Code:

for file in *
do
  mv -v ${file} `echo ${file} | sed "s/^smr_DB-ts_\([0-9]\)of\([0-9]\)_\.\([rp]\)/\(smr\)DB-ts\(\1of\2\).\3/"`
done

Like I said, trying to shove it all into one regex might create a maintenance nightmare.

khermans · 06-29-2004, 08:08 AM

Quote:

Originally posted by Dark_Helmet
I would avoid a single regex expression at all costs (for maintainability reasons) and break it up to something like this:

Like I said, trying to shove it all into one regex might create a maintenance nightmare.

Maybe you are correct :-) When would you choose to use regex over the multiple rename commands?

Kristian Hermansen

Dark_Helmet · 06-29-2004, 02:22 PM

For this particular situation, I don't think the speed is necessary for a couple reasons:[list=1][*]the archives are inaccessible[*]the number of files comprising an archive[/list=1]
Assuming mikshaw is correct in saying you trash a multi-part rar archive by changing the names of each part, then your archives are toast right now. It doesn't matter if they're in a database or not. A user might be able to query for a list of files that comprise one archive, but they can't do anything to them until thy're renamed, right? So there's no change in the situation by performing multiple renames. I might have to do five commands, but the archives can't be any more inaccessible if they're incorrectly named. What I'm getting at is "smr_DB-ts_2of2_.rar" is still just as inaccessible as "(smr_DB-ts_2of2_.rar".

Given the costraints of the filenames, you've said that an archive cannot have more than 100 constituent parts: .rar, .p01, .p02, ... , .p99. Unless there's something about the data files that I'm not aware of, then it should be possible to rename these individual archives one-at-a-time. That is, rename smr_DB-ts_1of2_.*, then rename smr_DB-ts_2of2_.*, etc. I have to assume that one entire archive is distinct and independent of the other archives, meaning that this collection of archives can sustain one archive changing its name at a time. In that case, you're no longer talking about speed performance for "millions" of files, but 100. The performance difference between issuing 5 renames and one big sed is severely diminished on a data set of 100 files versus 1,000,000.

Now, if you had a single archive of some unbelievably large number of pieces (1,000+ is a nice arbitrary number), the data is currently "live" (accessible by users), and it must stay live, then I would look into making some sort of single command to handle the name changes. Actually, I'd probably try to arrange a shutdown time first (30 minutes would probably be more than enough) before going to a complicated regex

khermans · 06-29-2004, 02:34 PM

Good discussion. You are correct that having many more files would show the performance increase, and that for this example it is not pertinent. I did also mention at the beginning of the topic that this was a general case and that the specific example was just for clarity.

Also, the multi-part rar's still extract fine whether or not the name is changed, as long as ALL the files are changed to reflect the new format. The problem was actually that the PAR (parity) files look for specific file names and that they were not finding them! I did use multiple rename commands to test this and eveything worked great! Thanks for your help :-)

Kristian Hermansen