BASH script optimization for testing large number of files
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I don't know, but I thought it was an interesting problem and got interested, so I had a go and I came up with this somewhat insane pipeline and not a single loop!
I wouldn't be surprised if there's some sort of subtle error lurking in here, but it seems to work (as long as your filenames don't contain a comma or a newline)
Nice code there GazL. I also suggest that you use tabs or any character that is not part of a filename like | as delimiter so that the filenames that contain commas will not be incorrectly parsed. I wonder what is the real intention with multilevel directories though. Will the files be copied with the same directory structure or not?
My understanding is that they will all be copied to a single destination file but then the user will be left to organise how they like.
I was thinking that the problem could arise when two unique files with the same filename exists in different source directories. It's fine though if it is what's really intended.
My understanding was also that everything was to be dumped to the DESTDIR and hand sorted from that point.
It's a good point about the comma konsolebox, I agree \t would have been much better but I had my CSV head on and it didn't even occur to me. I'll remember that in future. thanks, good tip.
As for the duplicate files with different names idea, you could do something similar with sort/uniq if you included a md5sum of the file in the data to be sorted and sorted/uniq'd on that instead of the base filename. Of course, that would make the scanning process take much, much longer as all the files would have to be read through to generate the hashes.
Hey GazL. It's always a fun to make suggestions when I can.
Quote:
Originally Posted by GazL
As for the duplicate files with different names idea, you could do something similar with sort/uniq if you included a md5sum of the file in the data to be sorted and sorted/uniq'd on that instead of the base filename. Of course, that would make the scanning process take much, much longer as all the files would have to be read through to generate the hashes.
Or I think it's simpler if they were put to destination directories with names that are the same to the source directory that they were in. There's always a solution to that but I only want to make it if it is really required. This is why I was asking if the thread was already solved or not.
The problem seems to involve recursively finding all filenames in $DESTDIR, and an optimal solution would do this exactly once. So if the result of this step was saved in a hash table, it should provide the fastest lookup of all $SOURCE filenames in $DESTDIR. I don't think bash can play much of a role in this, but Perl likes hashes...
Hello theNbomr. Sorry but I think bash can and it could be a lot simpler than Perl's. Also if that's what's really intended, bash 4.0 already has support for associative arrays or hashes.
Only this time that the intended implementation is still not clear. It's just not good to create codes out of uncertain assumptions.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.