LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 01-18-2011, 04:55 PM   #16
lumak
Member
 
Registered: Aug 2008
Location: Phoenix
Distribution: Arch
Posts: 799
Blog Entries: 32

Rep: Reputation: 109Reputation: 109

It may be better to have an import directory and only allow the script or database manager to muck with the final directory. This would significantly reduce the number of files you have to search just to update the database. Not to mention, when you have so many files, you just don't want people choosing them selves where to place a file and hope the indexer and management scripts picks it up and adds it properly.

This could be managed by an html submit page to upload files to a holding location. You would probably want some sort of approval process for all submissions before committing them to the final location. This may mean a delay in getting the files on the play list... but you are already significantly delayed. Additionally the final location would always be perfect. Even if something did make it through, you would probably want a management script instead of editing the file names directly. This way the database would be updated properly.

If you want to reindex the whole database, this is a different function entirely and deserves its own script.

Last edited by lumak; 01-18-2011 at 04:59 PM.
 
Old 01-18-2011, 06:03 PM   #17
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
I think I get what you're saying, but there's a problem with that.

Right now we get all our music from one trusted source. They're just really bad at tagging the files.

Basically, the way it works (up till now) is we get the files in, do a quick check on them, and then sort them into the proper directories.

BUT, we occasionally make mistakes. For example, say we get a file in named (and tagged) as: Janis Joplin - Brand New Key.mp3

We drop it in /mnt/music/J/Janis Joplin/Janis Joplin - Brand New Key.mp3 and it gets added to the database as such.

BUT, a few days later a DJ plays that on his show, and wait! That's not Janis, that's Melanie (the ORIGINAL artist, btw)! So the DJ reports it as mistagged. I (or one of the other techs) read the report, retag the file, and go on our way.

Next time the indexer script runs, it goes "Oh! New song! Melanie - Brand New Key.mp3" and adds it to the database. So now we have TWO entries for the same song.

So basically, we NEED to do a reindex about once a month. Trouble is, we had a contract renewal first of the month, and we're getting a LOT more music lately.

But I'm definitely interested in getting a separate script up just to index NEW music.

Also, I temporarily fixed the second bottleneck just to get the database back up. I cut out the track length check, and re-ran. Took 3 hours 18 minutes. Suggestions on getting the script to modify the database entries instead of adding duplicates so we can go back later and add the track lengths?

Quote:
Originally Posted by lumak View Post
It may be better to have an import directory and only allow the script or database manager to muck with the final directory. This would significantly reduce the number of files you have to search just to update the database. Not to mention, when you have so many files, you just don't want people choosing them selves where to place a file and hope the indexer and management scripts picks it up and adds it properly.

This could be managed by an html submit page to upload files to a holding location. You would probably want some sort of approval process for all submissions before committing them to the final location. This may mean a delay in getting the files on the play list... but you are already significantly delayed. Additionally the final location would always be perfect. Even if something did make it through, you would probably want a management script instead of editing the file names directly. This way the database would be updated properly.

If you want to reindex the whole database, this is a different function entirely and deserves its own script.
 
Old 01-18-2011, 06:04 PM   #18
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
That's certainly doable, but we'd still be bogged down checking track lengths (from testing today that's shown to be the biggest bottleneck).

Quote:
Originally Posted by Dark_Helmet View Post
Well, the NAS part is probably why you got no results. You would need to run updatedb with the NAS mounted at least once, or better yet, create another database specific to store the files on /mnt/music.

With the NAS mounted, you can create a database specific to the NAS and store it in your own directory if you run the following:
Code:
updatedb -l 0 -o musicfiles.db -U /mnt/music
locate -d musicfiles.db --regex "^/mnt/music/.*\.mp3$"
Of course, the locate command doesn't affect the database, but just shows you the results of a search for mp3 files.

EDIT:
To keep the database current, you would need to run a cron job to run updatedb as necessary.
 
Old 01-18-2011, 08:11 PM   #19
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,974

Rep: Reputation: 849Reputation: 849Reputation: 849Reputation: 849Reputation: 849Reputation: 849Reputation: 849
Hello DJCharlie,
Quote:
Originally Posted by DJCharlie View Post
I think I get what you're saying, but there's a problem with that...
...
BUT, we occasionally make mistakes. For example, say we get a file in named (and tagged) as: Janis Joplin - Brand New Key.mp3

We drop it in /mnt/music/J/Janis Joplin/Janis Joplin - Brand New Key.mp3 and it gets added to the database as such.

BUT, a few days later a DJ plays that on his show, and wait! That's not Janis, that's Melanie (the ORIGINAL artist, btw)! So the DJ reports it as mistagged. I (or one of the other techs) read the report, retag the file, and go on our way.

Next time the indexer script runs, it goes "Oh! New song! Melanie - Brand New Key.mp3" and adds it to the database. So now we have TWO entries for the same song.
...
So basically, we NEED to do a reindex about once a month...
I think there is a fatal error in this approach.

When I hava a database, then the database is responsible for the content of the directory/storage and also responsible for any changes. This means lumak is right with his suggestions.

If one of you has to change an mp3-file then he/she must take this file via the databaseinterface, change it and pass it back to the databaseinterface. And the database itself (via it's interface has to put the file back into the directory)

Every database works this way.

Imagine a storage of a big warehouse, they have a database where every product is registered, and sometimes someone runs into the storage and changes products manually... and expects a bashscript to find this changes and update the database. This will never work!

You need a well-defined interface between the persons who change files and the database and it's storagedirectory.

Markus

Last edited by markush; 01-18-2011 at 08:13 PM.
 
Old 01-18-2011, 10:12 PM   #20
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,655

Rep: Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967
Quote:
For example, say we get a file in named (and tagged) as: Janis Joplin - Brand New Key.mp3

We drop it in /mnt/music/J/Janis Joplin/Janis Joplin - Brand New Key.mp3 and it gets added to the database as such.

BUT, a few days later a DJ plays that on his show, and wait! That's not Janis, that's Melanie (the ORIGINAL artist, btw)! So the DJ reports it as mistagged. I (or one of the other techs) read the report, retag the file, and go on our way.

Next time the indexer script runs, it goes "Oh! New song! Melanie - Brand New Key.mp3" and adds it to the database. So now we have TWO entries for the same song.
With this example, if you were to provide a date with each submission, or if the primary key is an increasing unique number, then the database could easily have
a script run over it to find duplicates and keep the newest one.

Also, to save typos and the like I would probably have a script to create the path and file location instead of doing this by hand for new files.

As markus and lumak have said, if you simply place all new / changed files in a default location and then import all in that directory into the database you can
then easily create a script to produce information on what directories to create and where the file is to be located.
 
Old 01-18-2011, 11:32 PM   #21
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
The database actually uses an auto-increment for the primary key. The typos don't (usually) come from us. When we get the files, the tags usually are blank, or just a series of numbers, while the filenames are right.

I'm moving today's push into 00-Incoming right now, soon as it's done I'll start working on an import script.
 
Old 01-19-2011, 12:33 AM   #22
lumak
Member
 
Registered: Aug 2008
Location: Phoenix
Distribution: Arch
Posts: 799
Blog Entries: 32

Rep: Reputation: 109Reputation: 109
The only other thing I could suggest is not only having the "00-Incoming" directory but maybe even having a "01-Submit" directory. This would give you the chance to retag/rename a file, move it out of your way, and allow the import script to be called by a timed job every day.

I don't know at what rate you are getting files so that may alleviate some other management headaches.

Also, the execution time of 'find' and 'ls' is less than a second when there are no sub directories to check. So the script running on "01-Submit" would go a lot faster at least for that command. The only problem is that gui file managers often experience lag time when trying to list the files in a directory that have too many files.



Either way, these suggestions only fix the daily maintenance time of importing new files.

Re indexing the whole database takes time no matter what type of database it is. You may see improvements if you were to write the reindexer as a real program... but I'm not sure about that.
 
Old 01-19-2011, 05:14 AM   #23
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,974

Rep: Reputation: 849Reputation: 849Reputation: 849Reputation: 849Reputation: 849Reputation: 849Reputation: 849
Hello together,
Quote:
Originally Posted by DJCharlie
...The database actually uses an auto-increment for the primary key...
Quote:
Originally Posted by lumak View Post
...Re indexing the whole database takes time no matter what type of database it is. You may see improvements if you were to write the reindexer as a real program... but I'm not sure about that.
for me this smells like "directory access protocol". What I mean is, you have many records of small size (filenames and id3tags), but no need for complex query. A database which is based on hashes and hash-lookups will be much more efficient than anything with primary keys.
Quote:
Originally Posted by DJCharlie
...The typos don't (usually) come from us...
Typos are normal and have to be corrected in a well-defined way. Any database must support this. In my opinion a normal user must not be allowed to edit files in the mp3-directory manually.

Markus
 
Old 01-19-2011, 07:08 AM   #24
GazL
Senior Member
 
Registered: May 2008
Posts: 3,481

Rep: Reputation: 1016Reputation: 1016Reputation: 1016Reputation: 1016Reputation: 1016Reputation: 1016Reputation: 1016Reputation: 1016
I'd look at re-writing the script to use a single mysql "LOAD DATA INFILE" to load all the data into your table in a single operation outside of your loop, rather than what you currently do, which is to: fire up a mysql process, connect to the database, insert a single row, disconnect, exit the mysql process; within your loop for each of your 650,000 files! I'm not surprised it's slow.

The mysql manual suggests that using "load data infile" will be 20x faster than using 'insert' as it is, and that's without all the unnecessary process-creation/connection/disconnection/process-teardown overhead that your current approach will be generating.


Here's an example direct from the manual:
Code:
mkfifo /mysql/data/db1/ls.dat
chmod 666 /mysql/data/db1/ls.dat
find / -ls > /mysql/data/db1/ls.dat &
mysql -e "LOAD DATA INFILE 'ls.dat' INTO TABLE t1" db1

Last edited by GazL; 01-19-2011 at 07:15 AM. Reason: typo
 
Old 01-19-2011, 09:50 AM   #25
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
One thing you should consider is using inotify-tools on the server to track changes to the mp3 files.

Use inotifywait to note which files in your mp3 archive have been deleted, moved, or closed after being open for writing. Just append their full paths to a log file. Do this in a loop, until the script gets the TERM signal. This script would need to always run (as a service), but it'd be extremely lightweight. It should not take more than a dozen lines of simple Bash code; most of the time the script will be sleeping. The inotify mechanism itself is extremely efficient.

You'd also have a second service or a script in cron to process the log file at regular intervals. (A service can simply sleep for a few seconds at a time if the log file does not exist; a cron script would just exit.)
This script would run at a very low priority so your normal services would get priority. (Using ionice -n +5 nice -n +10 for example.)
The script would take the log file, sort and filter it through sort and uniq to get each file path only once, then delete all of them from the database, and finally reindex and add back to the database all mp3 files that still exist. There are some necessary tricks to make sure there's only one job running, and that failed runs are restarted at the next invocation, but those are only needed to make it extremely robust.. and with a little bit of help, it's not even difficult.

This way, if you retag an mp3 file, or replace it with the correct copy, or rename or move it, the reindexing will just happen (a bit later on). You probably want to reindex the entire collection every few months just to be sure, but there really is no need to reindex it all. Just those that have changed.

When using inotify tools such as inotifywait, note that they are not synchronous. You get reliable information on events that have already happened. It is quite possible further changes have been made to the files, but you haven't yet gotten the notification; the events have, after all, occurred some time in the past. The latencies are extremely small in real life, but it's best to keep it in mind when writing the scripts.
Nominal Animal

Last edited by Nominal Animal; 03-21-2011 at 07:17 AM.
 
Old 01-19-2011, 08:11 PM   #26
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
Ok, I'm back. Been dealing with a rather persistent attacker from Seattle who REALLY wants to get on our servers all day.

ANYway... We've gotten the indexing covered now, thanks to you guys, and I'll be talking with the boss tomorrow about instituting policies for retagging existing files.

What we have now:

1. A new script that takes files from /mnt/music/01-Import/ and adds them to the database, and moves them to their proper locations.
2. A new script that searches the database and fixes any entries that don't have the track length field.
3. A new script that searches the database for duplicate entries (still working on that one).

So. Am I missing anything?
 
  


Reply

Tags
bash, mysql


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
SSH connection from BASH script stops further BASH script commands tardis1 Linux - Newbie 3 12-06-2010 09:56 AM
[SOLVED] Using a long Bash command including single quotes and pipes in a Bash script antcore Linux - General 9 07-22-2009 12:10 PM
grep'ing and sed'ing chunks in bash... need help on speeding up a log parser. elinenbe Programming 4 04-22-2009 11:17 AM
Speeding up Shell Script execution?? funkymunky Programming 8 07-16-2004 09:39 PM
Speeding up the script, or the SQL Query? knickers Programming 1 04-13-2004 12:57 PM


All times are GMT -5. The time now is 09:55 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration