LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 01-17-2011, 07:09 PM   #1
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Rep: Reputation: 4
Unhappy bash: Speeding up a script?


Ok folks, here at the station we maintain a database of all the music we have. The problem is, the bash script we're using to index the music takes FOREVER. A copy of the script is below. Any suggestions on speeding it up? 15+ hours to rebuild the database is just way too much!

We need it to update the database daily, but using find even with the new files added takes over 6 hours.

Code:
#!/bin/bash

mv /var/www/music/index.hold /var/www/music/index.html

before=$(date +%s)
tdy=$(date +%u)

sqluser="username"
sqlpass="password"
sqldb="theArchive"
sqltbl="music"

echo "TRUNCATE TABLE music;" | mysql -u$sqluser -p$sqlpass -hlocalhost $sqldb

addfiles="/usr/local/autodj/tmp/theArchive.tmp"
addfiles2="/usr/local/autodj/tmp/theArchive2.tmp"

find "/mnt/music" -type f -name "*.mp3" > $addfiles

sed -i '/Incomplete/d' $addfiles
sed -i '/00-Incoming/d' $addfiles
cp $addfiles $addfiles2

sort -f $addfiles2 > $addfiles

cat $addfiles |while true
do read LINE || break

sline=${LINE#*/}
sline=${sline#*/}
sline=${sline#*/}
sdir=$sline
sline=${sline#*/}
sline=${sline#*/}
sline=${sline%.mp3*}

sartist=${sline% - *}
stitle=${sline#* - }

spath=$LINE

sdir=${sdir%/*}
sdir=${sdir%/*}

secs=`python -c "import tagpy; f = tagpy.FileRef('$LINE'); print f.audioProperties().length"`

hours=$((secs / 3600))
seconds=$((secs % 3600))
minutes=$((secs / 60))
seconds=$((secs % 60))

if [ $seconds -lt "10" ]
  then
  then
    seconds="0$seconds"
fi
tlen="$minutes:$seconds"

echo "INSERT INTO $sqltbl VALUES (NULL,\"${sartist}\",\"${stitle}\",\"${spath}\",\"${sdir}\",\"${tlen}\");" | mysql -u$sqluser -p$sqlpass -htranscoder1 $sqldb

done

after=$(date +%s)

elapsed_seconds=$(expr $after - $before)

hou=$(expr $elapsed_seconds / 3600)
min=$(expr $elapsed_seconds % 3600 / 60)
sec=$(expr $elapsed_seconds % 60)
echo "FINISHED!"
echo "Elapsed Time: $hou:$min:$sec"

mv /var/www/music/index.html /var/www/music/index.hold

exit
 
Old 01-17-2011, 07:19 PM   #2
lugoteehalt
Senior Member
 
Registered: Sep 2003
Location: UK
Distribution: Debian
Posts: 1,215
Blog Entries: 2

Rep: Reputation: 49
I shouldn't reply to this cause I've no idea. But is it the 'find' command that's taking up the time, that would be at first glance the first thought. Use 'locate' instead??
 
Old 01-17-2011, 07:26 PM   #3
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
There's actually 2 bottlenecks, and I have absolutely no idea how to fix either of them.

1. find is a bottleneck taking about 1 hour 18 minutes to run (on average).

2. the track length calculation takes the other 13 hours, 22 minutes. Is there ANY faster way to get the track length from an mp3?
 
Old 01-17-2011, 08:16 PM   #4
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
I can see why #2 is the bottleneck. You are restarting the python interpreter, reloading the tagpy module, and then "cleaning up" for every file you find.

There are a few basic approaches to take. Undoubtedly, there others...

1. Convert your entire script to Python. Everything you do in the shell script can be done in Python. That way, you incur the overhead of starting the interpreter only once. Of course, if you have no python developers, that's a bit of a problem. I'm only starting to play with python, and I might be able to help, but I can't guarantee anything.

2. Convert the minimal python script to handle more than one filename. If the small python command is extended to handle two filenames at a time (and return, say, two space-separated values representing track length), you cut the overhead in half (note: the overhead is not necessarily equal to the total run-time you're seeing). It should scale too--as in, get the script to handle five filenames at a time, and the overhead drops to 20% of processing files one-at-a-time.

3. Consider something other than python. There are command line tools available that will read mp3 tags. One such tool is id3v2. Invoking one of those specialized tools is likely to use far fewer resources and time than invoking the python interpreter.


As for the find command, you might consider the "locate" command. For instance, start a cron job every night (or more frequent as you need) to run "updatedb." Then run locate. In your case, it's probably something like:
Code:
locate --regex "^/mnt/music.*\.mp3$"
Try that on the command line, see what it spits out, and time it
 
Old 01-17-2011, 09:40 PM   #5
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,623

Rep: Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944
Well I am not sure on the impact of time compared to the python script, but ffmpeg could give you the time without any further calculation:
Code:
ffmpeg -i <mp3_file> |& awk '/Duration/{print gensub(/,/,"","1",$2)}'
There are some cosmetic changes to the sed and maybe some of your parameter substitutions that could be made, but again I am not
sure any net worth there.
Code:
sed -ri '/Incomplete|00-Incoming/d' $addfiles
SO the following might be able to be reduced if you would show the typical path:
Code:
sline=${LINE#*/}
sline=${sline#*/}
sline=${sline#*/}
sdir=$sline
sline=${sline#*/}
sline=${sline#*/}
sline=${sline%.mp3*}
Another minor cosmetic thing would be to simply direct your file straight into while instead of using the break:
Code:
while read -r LINE
do
    <your stuff here>
done<$addfiles
Like I said though, most of these are cosmetic and no real time save that I am aware of.
 
Old 01-17-2011, 10:59 PM   #6
lumak
Member
 
Registered: Aug 2008
Location: Phoenix
Distribution: Arch
Posts: 799
Blog Entries: 32

Rep: Reputation: 109Reputation: 109
why does find take so long? How many files are under /mnt/music? I don't see why this would take so long.

What types of files are under the directory? Have you tried 'ls -R */*.mp3'? However, on my music collection, find appeared to be a few fractions of a second faster than 'ls'

Are they all new files or are some of them old? Is this an import only directory?

Is there a format to the file names that would let you tailor a command?

Could you restrict the file command to one file system or one directory to save time?

also, use single quotes on the find command. I've seen weird issues when using "" and *. the other option is to use no quotes and escape the *. e.g. \*.mp3

Last edited by lumak; 01-17-2011 at 11:06 PM.
 
Old 01-17-2011, 11:41 PM   #7
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
Ok, a batch of replies at once...

I've gotten a bit of help from the fine folks in #bash on freenode. It's -slightly- faster, but going from 15+ hours to 12+ hours is still way too long...

the locate --regex "^/mnt/music.*\.mp3$" gave me exactly nothing. /mnt/music is actually a network share on a NAS.

As of 6am this morning, /mnt/music holds 659,318 music files (hey, we're a radio station!).

All the files are stored like this: /mnt/music/[First letter of Artist Name]/[Artist Name]/[Artist Name - Title].mp3

Here's the code as-is:

Code:
#!/bin/bash

mv /var/www/music/index.hold /var/www/music/index.html

before=$(date +%s)
tdy=$(date +%u)

sqluser="username"
sqlpass="password"
sqldb="theArchive"
sqltbl="music"

echo "TRUNCATE TABLE music;" | mysql -u$sqluser -p$sqlpass -hlocalhost $sqldb

addfiles="/usr/local/autodj/tmp/theArchive.tmp"
addfiles2="/usr/local/autodj/tmp/theArchive2.tmp"

while read line; do

sline=${line#*/}
sline=${sline#*/}
sline=${sline#*/}
sdir=$sline
sline=${sline#*/}
sline=${sline#*/}
sline=${sline%.mp3*}

sartist=${sline% - *}
stitle=${sline#* - }

spath=$line

sdir=${sdir%/*}
sdir=${sdir%/*}

secs=`mp3info -p "%S" "$line"`

# secs=`python -c "import tagpy; f = tagpy.FileRef('$line'); print f.audioProperties().length"`

hours=$((secs / 3600))
seconds=$((secs % 3600))
minutes=$((secs / 60))
seconds=$((secs % 60))

if [ $seconds -lt "10" ]
  then
    seconds="0$seconds"
fi
tlen="$minutes:$seconds"

echo "INSERT INTO $sqltbl VALUES (NULL,\"${sartist}\",\"${stitle}\",\"${spath}\",\"${sdir}\",\"${tlen}\");" | mysql -u$sqluser -p$sqlpass -hlocalhost $sqldb

done < <(find /mnt/music -name '00-Incoming' -prune -o -name 'Incomplete' -prune -o -type f -name "*.mp3" -print)

after=$(date +%s)

elapsed_seconds=$(expr $after - $before)

hou=$(expr $elapsed_seconds / 3600)
min=$(expr $elapsed_seconds % 3600 / 60)
sec=$(expr $elapsed_seconds % 60)
echo "FINISHED!"
echo "Elapsed Time: $hou:$min:$sec"

mv /var/www/music/index.html /var/www/music/index.hold

exit
 
Old 01-18-2011, 01:47 AM   #8
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
Sorry, missed a couple of your questions...

It's a mix of new and old files. Part of the issue with that is when a DJ finds a mislabelled song, and fixes it, the indexer would register it as "new" again.

It's all one filesystem, with about 38 directories off the root, several thousand under that.

00-Incoming is the import only "holding area," until we check them off the lists for licensing. Incomplete is for currently being transferred.

We tried single quotes on find, but since the filenames have spaces, it didn't work well.

Quote:
Originally Posted by lumak View Post
why does find take so long? How many files are under /mnt/music? I don't see why this would take so long.

What types of files are under the directory? Have you tried 'ls -R */*.mp3'? However, on my music collection, find appeared to be a few fractions of a second faster than 'ls'

Are they all new files or are some of them old? Is this an import only directory?

Is there a format to the file names that would let you tailor a command?

Could you restrict the file command to one file system or one directory to save time?

also, use single quotes on the find command. I've seen weird issues when using "" and *. the other option is to use no quotes and escape the *. e.g. \*.mp3
 
Old 01-18-2011, 02:13 AM   #9
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,623

Rep: Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944
Well here are some of those cosmetic changes for you, along with adjustment to mp3info use:
Code:
#!/bin/bash

mv /var/www/music/index.hold /var/www/music/index.html

MUSIC=/mnt/music

before=$(date +%s)
tdy=$(date +%u)

sqluser="username"
sqlpass="password"
sqldb="theArchive"
sqltbl="music"

echo "TRUNCATE TABLE music;" | mysql -u$sqluser -p$sqlpass -hlocalhost $sqldb

addfiles="/usr/local/autodj/tmp/theArchive.tmp"
addfiles2="/usr/local/autodj/tmp/theArchive2.tmp"

while read line; do

sline=${line##*/}
sline=${sline%.mp3*}

sartist=${sline% - *}
stitle=${sline#* - }

sdir=${line#$MUSIC/}
sdir=${sdir%%/*}

tlen=`mp3info -p "%m:%s" "$line"`

echo "INSERT INTO $sqltbl VALUES (NULL,\"${sartist}\",\"${stitle}\",\"${line}\",\"${sdir}\",\"${tlen}\");" | mysql -u$sqluser -p$sqlpass -hlocalhost $sqldb

done < <(find $MUSIC -name '00-Incoming' -prune -o -name 'Incomplete' -prune -o -type f -name "*.mp3" -print)

after=$(date +%s)

elapsed_seconds=$(expr $after - $before)

hou=$(expr $elapsed_seconds / 3600)
min=$(expr $elapsed_seconds % 3600 / 60)
sec=$(expr $elapsed_seconds % 60)
echo "FINISHED!"
echo "Elapsed Time: $hou:$min:$sec"

mv /var/www/music/index.html /var/www/music/index.hold

exit
Another program I saw in another query here is called parallel which could be of use to you, check out details in the links included here
 
Old 01-18-2011, 04:34 AM   #10
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,623

Rep: Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944Reputation: 1944
Something to add here ... are we going about this all wrong??I am guessing that you do not receive 600K+ new files every day and we are truncating the database each time.
Perhaps we should be looking at a method to only add new and / or updated files and perhaps have the database do some of the processing??

Just a thought.
 
Old 01-18-2011, 10:15 AM   #11
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
While you're right, we don't get 600k+ per day, we DO need to reindex from scratch occasionally (with this many files there's always duplicates and mistagged files being corrected). Up until recently, it's been easier to wipe out the database and start fresh.

The parallel utility looks interesting, but I don't see how it could be applied.

Quote:
Originally Posted by grail View Post
Something to add here ... are we going about this all wrong??I am guessing that you do not receive 600K+ new files every day and we are truncating the database each time.
Perhaps we should be looking at a method to only add new and / or updated files and perhaps have the database do some of the processing??

Just a thought.
 
Old 01-18-2011, 10:51 AM   #12
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
After having a few more minutes to wake up... Would it be possible to have the script check if an entry already exists before trying to add it? Or would that slow it down even more?
 
Old 01-18-2011, 11:17 AM   #13
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Why not record the time that you last ran find, and only find files after that date. Also, as said before, there are better alternatives to find.
 
Old 01-18-2011, 11:59 AM   #14
DJCharlie
Member
 
Registered: Sep 2010
Posts: 37

Original Poster
Rep: Reputation: 4
Good idea, but we have to actually build the database first.

Plus, that still doesn't prevent duplicate/changed entries.
 
Old 01-18-2011, 04:50 PM   #15
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
Quote:
Originally Posted by DJCharlie
the locate --regex "^/mnt/music.*\.mp3$" gave me exactly nothing. /mnt/music is actually a network share on a NAS.
Well, the NAS part is probably why you got no results. You would need to run updatedb with the NAS mounted at least once, or better yet, create another database specific to store the files on /mnt/music.

With the NAS mounted, you can create a database specific to the NAS and store it in your own directory if you run the following:
Code:
updatedb -l 0 -o musicfiles.db -U /mnt/music
locate -d musicfiles.db --regex "^/mnt/music/.*\.mp3$"
Of course, the locate command doesn't affect the database, but just shows you the results of a search for mp3 files.

EDIT:
To keep the database current, you would need to run a cron job to run updatedb as necessary.

Last edited by Dark_Helmet; 01-18-2011 at 04:56 PM.
 
  


Reply

Tags
bash, mysql


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
SSH connection from BASH script stops further BASH script commands tardis1 Linux - Newbie 3 12-06-2010 09:56 AM
[SOLVED] Using a long Bash command including single quotes and pipes in a Bash script antcore Linux - General 9 07-22-2009 12:10 PM
grep'ing and sed'ing chunks in bash... need help on speeding up a log parser. elinenbe Programming 4 04-22-2009 11:17 AM
Speeding up Shell Script execution?? funkymunky Programming 8 07-16-2004 09:39 PM
Speeding up the script, or the SQL Query? knickers Programming 1 04-13-2004 12:57 PM


All times are GMT -5. The time now is 04:55 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration