bash: search out files by extension, remove spaces, copy elsewhere

schneidz · 02-24-2014, 11:16 AM

thanks for providing, i just tried on a directory tree with about 2,000 images and i am getting those errors too. for you it mite be because of the # not sure what my issue is (also not sure how to correct) ?

i think the problem is my command chokes if any filename (image or not) has an apostrophe ' in it... i think it gets interpreted as a close quote.

jamtat · 02-24-2014, 09:03 PM

Hmmm. Interesting possibility but I've scanned the file names of my images and not found any apostrophes. I really understand very poorly how tr works, but is it possible an apostrophe, under certain conditions (say, double-space), might be introduced by it? Thanks.

jamtat · 10-30-2014, 02:55 PM

I found a better way, using $RANDOM (in place of $(date +%N)), to alter each symlink name so that files sharing the same name will not overwrite each other (credit to http://www.cyberciti.biz/faq/bash-sh...andom-numbers/). The latest incarnation of this one-liner is

Code:

find `pwd` -exec sh -c "file -i '{}' | grep image.*charset=binary$" \; -exec sh -c 'filename="${0##*/}"; ln -sf "$0" sym/`echo $((RANDOM%900+99))-$filename | tr " " "-"`' {} \;

I can't see why a 3-digit random sequence would not be sufficient for my project, despite the fact that I will be processing thousands of files; maybe a 2-digit sequence would even suffice. In any case, it's easy, by changing 900+99 to increase or decrease the number of digits to one's liking.

ntubski · 10-30-2014, 08:40 PM

Why not just use the --backup=numbered option to ln?

As to to strange characters in files names, I would suggest trying to avoid the multiple layers of quoting involved when -execing sh.

Code:

find . | file --mime -0f - |
  grep --text 'image/[^;]*; charset=binary$' | 
  cut -d '' -f1 | 
  while read -r file ; do
     ln --backup=numbered -s "$file" "sym/$(basename "$file")" 
  done

jamtat · 10-31-2014, 10:22 AM

Thanks for the suggestion, ntubski. I have to say I like the $RANDOM solution better, since it prepends a set and limited number of digits/characters to the front of each file name, making it a simple matter of stripping those off to find the original file name. I'll test out your iteration to see whether it addresses the error message I was occasionally seeing when using the original script. I also discovered when testing out the original script again yesterday, that the image/ part of that iteration was producing some false positives, for example if a file or directory name had the word "image" in it, or if the file was an iso. Again, your input is much appreciated.

ntubski · 10-31-2014, 11:14 AM

Quote:

Originally Posted by jamtat

I can't see why a 3-digit random sequence would not be sufficient for my project, despite the fact that I will be processing thousands of files; maybe a 2-digit sequence would even suffice.

It only depends on how many files with the same name you have. Just a warning, if you have several files all with the same name the Birthday "paradox" applies, so you may need more digits than you think.

jamtat · 10-31-2014, 11:43 AM

Thanks for pointing out the birthday paradox, ntubski. A quick check reveals that I might initially be dealing with something like 15k-20k files (a number later to be drastically reduced once target files within that group have been identified). I'm not so proficient with mathematics, so I'll have to do some further investigation as to whether my 3-digit scheme would suffice, given the total number of files I'll initially be dealing with.

schneidz · 10-31-2014, 11:53 AM

i usually do something like this if i want something to be fairly unique:

Code:

fn=`date +%Y%j%H%M%S%N`

for shiggles: birthday heat map:
http://io9.com/how-common-is-your-birthday-512052896

how is it that july 2nd and 3rd are popular but the 4th of july is almost empty.
conversely feb 14 is popular but feb 13 and 15 not so much ?

jamtat · 10-31-2014, 12:22 PM

This looks like a good alternative as well (replace 3 with some other numeral to decrease/increase the pool):

Code:

echo $(</dev/urandom tr -dc A-Za-z0-9 | head -c3)

jamtat · 10-31-2014, 03:26 PM

Quote:

Originally Posted by ntubski

Why not just use the --backup=numbered option to ln?

As to to strange characters in files names, I would suggest trying to avoid the multiple layers of quoting involved when -execing sh.

Code:

find . | file --mime -0f - |
  grep --text 'image/[^;]*; charset=binary$' | 
  cut -d '' -f1 | 
  while read -r file ; do
     ln --backup=numbered -s "$file" "sym/$(basename "$file")" 
  done

I've fiddled around with your script, ntubski, trying to get an idea of how it works. My first observation is that it does not produce valid symlinks, probably because it is not recording the full path. If I replace the period with `pwd`, however, I do get valid symlinks. I also managed to splice in my echo $(</dev/urandom tr -dc A-Za-z0-9 | head -c3) to see whether I could manage that and got it working. One drawback to your script, unlike a more recent iteration of the one schneidz contributed, is that it leaves extraneous spaces in file names: his latest variant replaces those spaces with dashes. I also need to get rid of hash symbols that appear in some file names, but so far I've managed to do that by running

Code:

for file in *; do mv "$file" `echo $file | sed 's/#/Num/g'` ; done

in the directory where the symlinks have been placed. Anyway, thanks for helping me get a better grasp on how to execute this project.

ntubski · 10-31-2014, 09:30 PM

Quote:

Originally Posted by jamtat

I've fiddled around with your script, ntubski, trying to get an idea of how it works. My first observation is that it does not produce valid symlinks, probably because it is not recording the full path. If I replace the period with `pwd`, however, I do get valid symlinks.

Oh, I guess I didn't fully understand the layout of your files. When I tested here, the symlinks were valid.

Quote:

One drawback to your script, unlike a more recent iteration of the one schneidz contributed, is that it leaves extraneous spaces in file names: his latest variant replaces those spaces with dashes. I also need to get rid of hash symbols that appear in some file names,

I didn't realize that was part of the requirements, it's easily added.

Code:

find "$PWD" | file --mime -0f - |
  grep --text 'image/[^;]*; charset=binary$' |
  cut -d '' -f1 |
  while read -r file ; do
    ln --backup=numbered -s "$file" \
      "sym/$(</dev/urandom tr -dc A-Za-z0-9 | head -c3)-$(basename "$file" | tr ' ' - | sed 's/#/Num/g')"
  done

I left the --backup=numbered because it has no effect as long as there are no collisions.