LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   bash: Speeding up a script? (https://www.linuxquestions.org/questions/programming-9/bash-speeding-up-a-script-856912/)

lumak 01-18-2011 03:55 PM

It may be better to have an import directory and only allow the script or database manager to muck with the final directory. This would significantly reduce the number of files you have to search just to update the database. Not to mention, when you have so many files, you just don't want people choosing them selves where to place a file and hope the indexer and management scripts picks it up and adds it properly.

This could be managed by an html submit page to upload files to a holding location. You would probably want some sort of approval process for all submissions before committing them to the final location. This may mean a delay in getting the files on the play list... but you are already significantly delayed. Additionally the final location would always be perfect. Even if something did make it through, you would probably want a management script instead of editing the file names directly. This way the database would be updated properly.

If you want to reindex the whole database, this is a different function entirely and deserves its own script.

DJCharlie 01-18-2011 05:03 PM

I think I get what you're saying, but there's a problem with that.

Right now we get all our music from one trusted source. They're just really bad at tagging the files.

Basically, the way it works (up till now) is we get the files in, do a quick check on them, and then sort them into the proper directories.

BUT, we occasionally make mistakes. For example, say we get a file in named (and tagged) as: Janis Joplin - Brand New Key.mp3

We drop it in /mnt/music/J/Janis Joplin/Janis Joplin - Brand New Key.mp3 and it gets added to the database as such.

BUT, a few days later a DJ plays that on his show, and wait! That's not Janis, that's Melanie (the ORIGINAL artist, btw)! So the DJ reports it as mistagged. I (or one of the other techs) read the report, retag the file, and go on our way.

Next time the indexer script runs, it goes "Oh! New song! Melanie - Brand New Key.mp3" and adds it to the database. So now we have TWO entries for the same song.

So basically, we NEED to do a reindex about once a month. Trouble is, we had a contract renewal first of the month, and we're getting a LOT more music lately.

But I'm definitely interested in getting a separate script up just to index NEW music.

Also, I temporarily fixed the second bottleneck just to get the database back up. I cut out the track length check, and re-ran. Took 3 hours 18 minutes. Suggestions on getting the script to modify the database entries instead of adding duplicates so we can go back later and add the track lengths?

Quote:

Originally Posted by lumak (Post 4229239)
It may be better to have an import directory and only allow the script or database manager to muck with the final directory. This would significantly reduce the number of files you have to search just to update the database. Not to mention, when you have so many files, you just don't want people choosing them selves where to place a file and hope the indexer and management scripts picks it up and adds it properly.

This could be managed by an html submit page to upload files to a holding location. You would probably want some sort of approval process for all submissions before committing them to the final location. This may mean a delay in getting the files on the play list... but you are already significantly delayed. Additionally the final location would always be perfect. Even if something did make it through, you would probably want a management script instead of editing the file names directly. This way the database would be updated properly.

If you want to reindex the whole database, this is a different function entirely and deserves its own script.


DJCharlie 01-18-2011 05:04 PM

That's certainly doable, but we'd still be bogged down checking track lengths (from testing today that's shown to be the biggest bottleneck).

Quote:

Originally Posted by Dark_Helmet (Post 4229233)
Well, the NAS part is probably why you got no results. You would need to run updatedb with the NAS mounted at least once, or better yet, create another database specific to store the files on /mnt/music.

With the NAS mounted, you can create a database specific to the NAS and store it in your own directory if you run the following:
Code:

updatedb -l 0 -o musicfiles.db -U /mnt/music
locate -d musicfiles.db --regex "^/mnt/music/.*\.mp3$"

Of course, the locate command doesn't affect the database, but just shows you the results of a search for mp3 files.

EDIT:
To keep the database current, you would need to run a cron job to run updatedb as necessary.


markush 01-18-2011 07:11 PM

Hello DJCharlie,
Quote:

Originally Posted by DJCharlie (Post 4229313)
I think I get what you're saying, but there's a problem with that...
...
BUT, we occasionally make mistakes. For example, say we get a file in named (and tagged) as: Janis Joplin - Brand New Key.mp3

We drop it in /mnt/music/J/Janis Joplin/Janis Joplin - Brand New Key.mp3 and it gets added to the database as such.

BUT, a few days later a DJ plays that on his show, and wait! That's not Janis, that's Melanie (the ORIGINAL artist, btw)! So the DJ reports it as mistagged. I (or one of the other techs) read the report, retag the file, and go on our way.

Next time the indexer script runs, it goes "Oh! New song! Melanie - Brand New Key.mp3" and adds it to the database. So now we have TWO entries for the same song.
...
So basically, we NEED to do a reindex about once a month...

I think there is a fatal error in this approach.

When I hava a database, then the database is responsible for the content of the directory/storage and also responsible for any changes. This means lumak is right with his suggestions.

If one of you has to change an mp3-file then he/she must take this file via the databaseinterface, change it and pass it back to the databaseinterface. And the database itself (via it's interface has to put the file back into the directory)

Every database works this way.

Imagine a storage of a big warehouse, they have a database where every product is registered, and sometimes someone runs into the storage and changes products manually... and expects a bashscript to find this changes and update the database. This will never work!

You need a well-defined interface between the persons who change files and the database and it's storagedirectory.

Markus

grail 01-18-2011 09:12 PM

Quote:

For example, say we get a file in named (and tagged) as: Janis Joplin - Brand New Key.mp3

We drop it in /mnt/music/J/Janis Joplin/Janis Joplin - Brand New Key.mp3 and it gets added to the database as such.

BUT, a few days later a DJ plays that on his show, and wait! That's not Janis, that's Melanie (the ORIGINAL artist, btw)! So the DJ reports it as mistagged. I (or one of the other techs) read the report, retag the file, and go on our way.

Next time the indexer script runs, it goes "Oh! New song! Melanie - Brand New Key.mp3" and adds it to the database. So now we have TWO entries for the same song.
With this example, if you were to provide a date with each submission, or if the primary key is an increasing unique number, then the database could easily have
a script run over it to find duplicates and keep the newest one.

Also, to save typos and the like I would probably have a script to create the path and file location instead of doing this by hand for new files.

As markus and lumak have said, if you simply place all new / changed files in a default location and then import all in that directory into the database you can
then easily create a script to produce information on what directories to create and where the file is to be located.

DJCharlie 01-18-2011 10:32 PM

The database actually uses an auto-increment for the primary key. The typos don't (usually) come from us. When we get the files, the tags usually are blank, or just a series of numbers, while the filenames are right.

I'm moving today's push into 00-Incoming right now, soon as it's done I'll start working on an import script.

lumak 01-18-2011 11:33 PM

The only other thing I could suggest is not only having the "00-Incoming" directory but maybe even having a "01-Submit" directory. This would give you the chance to retag/rename a file, move it out of your way, and allow the import script to be called by a timed job every day.

I don't know at what rate you are getting files so that may alleviate some other management headaches.

Also, the execution time of 'find' and 'ls' is less than a second when there are no sub directories to check. So the script running on "01-Submit" would go a lot faster at least for that command. The only problem is that gui file managers often experience lag time when trying to list the files in a directory that have too many files.



Either way, these suggestions only fix the daily maintenance time of importing new files.

Re indexing the whole database takes time no matter what type of database it is. You may see improvements if you were to write the reindexer as a real program... but I'm not sure about that.

markush 01-19-2011 04:14 AM

Hello together,
Quote:

Originally Posted by DJCharlie
...The database actually uses an auto-increment for the primary key...

Quote:

Originally Posted by lumak (Post 4229633)
...Re indexing the whole database takes time no matter what type of database it is. You may see improvements if you were to write the reindexer as a real program... but I'm not sure about that.

for me this smells like "directory access protocol". What I mean is, you have many records of small size (filenames and id3tags), but no need for complex query. A database which is based on hashes and hash-lookups will be much more efficient than anything with primary keys.
Quote:

Originally Posted by DJCharlie
...The typos don't (usually) come from us...

Typos are normal and have to be corrected in a well-defined way. Any database must support this. In my opinion a normal user must not be allowed to edit files in the mp3-directory manually.

Markus

GazL 01-19-2011 06:08 AM

I'd look at re-writing the script to use a single mysql "LOAD DATA INFILE" to load all the data into your table in a single operation outside of your loop, rather than what you currently do, which is to: fire up a mysql process, connect to the database, insert a single row, disconnect, exit the mysql process; within your loop for each of your 650,000 files! I'm not surprised it's slow.

The mysql manual suggests that using "load data infile" will be 20x faster than using 'insert' as it is, and that's without all the unnecessary process-creation/connection/disconnection/process-teardown overhead that your current approach will be generating.


Here's an example direct from the manual:
Code:

mkfifo /mysql/data/db1/ls.dat
chmod 666 /mysql/data/db1/ls.dat
find / -ls > /mysql/data/db1/ls.dat &
mysql -e "LOAD DATA INFILE 'ls.dat' INTO TABLE t1" db1


Nominal Animal 01-19-2011 08:50 AM

One thing you should consider is using inotify-tools on the server to track changes to the mp3 files.

Use inotifywait to note which files in your mp3 archive have been deleted, moved, or closed after being open for writing. Just append their full paths to a log file. Do this in a loop, until the script gets the TERM signal. This script would need to always run (as a service), but it'd be extremely lightweight. It should not take more than a dozen lines of simple Bash code; most of the time the script will be sleeping. The inotify mechanism itself is extremely efficient.

You'd also have a second service or a script in cron to process the log file at regular intervals. (A service can simply sleep for a few seconds at a time if the log file does not exist; a cron script would just exit.)
This script would run at a very low priority so your normal services would get priority. (Using ionice -n +5 nice -n +10 for example.)
The script would take the log file, sort and filter it through sort and uniq to get each file path only once, then delete all of them from the database, and finally reindex and add back to the database all mp3 files that still exist. There are some necessary tricks to make sure there's only one job running, and that failed runs are restarted at the next invocation, but those are only needed to make it extremely robust.. and with a little bit of help, it's not even difficult.

This way, if you retag an mp3 file, or replace it with the correct copy, or rename or move it, the reindexing will just happen (a bit later on). You probably want to reindex the entire collection every few months just to be sure, but there really is no need to reindex it all. Just those that have changed.

When using inotify tools such as inotifywait, note that they are not synchronous. You get reliable information on events that have already happened. It is quite possible further changes have been made to the files, but you haven't yet gotten the notification; the events have, after all, occurred some time in the past. The latencies are extremely small in real life, but it's best to keep it in mind when writing the scripts.
Nominal Animal

DJCharlie 01-19-2011 07:11 PM

Ok, I'm back. Been dealing with a rather persistent attacker from Seattle who REALLY wants to get on our servers all day.

ANYway... We've gotten the indexing covered now, thanks to you guys, and I'll be talking with the boss tomorrow about instituting policies for retagging existing files.

What we have now:

1. A new script that takes files from /mnt/music/01-Import/ and adds them to the database, and moves them to their proper locations.
2. A new script that searches the database and fixes any entries that don't have the track length field.
3. A new script that searches the database for duplicate entries (still working on that one).

So. Am I missing anything?


All times are GMT -5. The time now is 03:30 AM.