LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-29-2008, 03:33 AM   #1
robel
Member
 
Registered: Oct 2008
Location: Norway
Distribution: Slackware
Posts: 77

Rep: Reputation: 19
MYSQL: SQL query takes 10 minutes to execute. Need help optimizing.


Using MySQL version 5.0.67 running on a 2.40GHz dual core Intel CPU.

I have the following table:

Code:
CREATE TABLE `files` (
  `name` varchar(250) NOT NULL,
  `type` char(1) NOT NULL,
  `size` bigint(20) NOT NULL,
  `date` datetime NOT NULL,
  `md5` char(32) default NULL,
  `diskid` int(11) default NULL,
  KEY `md5index` (`md5`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
It contains 28573 rows.

md5 is the md5sum of the files.

Now, some of the files are identical.

Code:
SELECT count(*), md5 
FROM files 
WHERE md5 IS NOT NULL
GROUP BY md5 
HAVING count(*) > 1;
Mysql says 212 rows in set (0.03 sec)

Say I want to retrieve the name of all files which has more than 3 duplicate md5

Code:
SELECT name, date, md5 
FROM files 
WHERE md5 IN 
  (SELECT md5 
   FROM files 
   WHERE md5 IS NOT NULL 
   GROUP BY md5 
   HAVING count(*) > 3
  );
This query takes 10 minutes to execute. The subquery returns 33 rows in set (0.02 sec) and the entire query returns 334 rows in set (10 min 4.01 sec)

This is obviously a bad query. Reading the entire table takes 0.10 sec.

In Sybase this is rather easy because I can avoid the subquery, but as far as I know MySQL would not work this way. The same applies to MSSQL.

Any suggestion on how to improve this query?
 
Old 10-29-2008, 07:41 AM   #2
harry2006
Member
 
Registered: Aug 2006
Location: /hawaii/honolulu/downtown
Distribution: Fedora 10[Cambridge] and Ubuntu 9.04[Jaunty]
Posts: 201

Rep: Reputation: 30
Quote:
Originally Posted by robel View Post
Using MySQL version 5.0.67 running on a 2.40GHz dual core Intel CPU.

I have the following table:

Code:
CREATE TABLE `files` (
  `name` varchar(250) NOT NULL,
  `type` char(1) NOT NULL,
  `size` bigint(20) NOT NULL,
  `date` datetime NOT NULL,
  `md5` char(32) default NULL,
  `diskid` int(11) default NULL,
  KEY `md5index` (`md5`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
It contains 28573 rows.

md5 is the md5sum of the files.

Now, some of the files are identical.

Code:
SELECT count(*), md5 
FROM files 
WHERE md5 IS NOT NULL
GROUP BY md5 
HAVING count(*) > 1;
Mysql says 212 rows in set (0.03 sec)

Say I want to retrieve the name of all files which has more than 3 duplicate md5

Code:
SELECT name, date, md5 
FROM files 
WHERE md5 IN 
  (SELECT md5 
   FROM files 
   WHERE md5 IS NOT NULL 
   GROUP BY md5 
   HAVING count(*) > 3
  );
This query takes 10 minutes to execute. The subquery returns 33 rows in set (0.02 sec) and the entire query returns 334 rows in set (10 min 4.01 sec)

This is obviously a bad query. Reading the entire table takes 0.10 sec.

In Sybase this is rather easy because I can avoid the subquery, but as far as I know MySQL would not work this way. The same applies to MSSQL.

Any suggestion on how to improve this query?
one suggestion that might help, though i'm not sure, :-(
as far as my knowlegde is concerned, at teh lower level the query gets optimized and what i can say in your case is that this optimization is the default optimization available, so it might help a bit if you can provide your own procedures thereby asking the underlying "query manager"
to do a customised optimization. do think in this direction and let us know your views/issues etc. thank you.
lets see what other ppl/experts in this field have to say.

Last edited by harry2006; 10-29-2008 at 07:43 AM.
 
Old 10-29-2008, 08:31 AM   #3
CRC123
Member
 
Registered: Aug 2008
Distribution: opensuse, RHEL
Posts: 374
Blog Entries: 1

Rep: Reputation: 32
Hello, this is odd but it would make sense that it is executing that same sub-query 28,573 times. Here is my reasoning:
0.02s * 28,573 = 571.46s
570.46s * (1m/60s) = 9.52m
Note: 's' is seconds and 'm' is minutes

This is odd because normally databases will see that a sub-query is not dependent on any value in the outer query and will internally optimize the query so that it only runs the sub-query once. The only thing I can think of is that MySQL is seeing the same table and column names in the outer and inner queries and thinks it must run the inner query ever time because it may depend on the outer one. To test this, give specific names to the inner and outer query tables like so:
Code:
SELECT outer.name, outer.date, outer.md5 
FROM files AS outer
WHERE outer.md5 IN 
  (SELECT inner.md5 
   FROM files AS inner
   WHERE inner.md5 IS NOT NULL 
   GROUP BY inner.md5 
   HAVING count(*) > 3
  );
If the above query drastically reduces the query time, then I would speculate that what I said above is the problem.

Otherwise, we will have to play around with indexes and/or possibly find a better way to write the query.

What version of MySQL are you using?
 
Old 10-29-2008, 12:31 PM   #4
robel
Member
 
Registered: Oct 2008
Location: Norway
Distribution: Slackware
Posts: 77

Original Poster
Rep: Reputation: 19
Quote:
Originally Posted by CRC123 View Post
Hello, this is odd but it would make sense that it is executing that same sub-query 28,573 times. Here is my reasoning:
0.02s * 28,573 = 571.46s
570.46s * (1m/60s) = 9.52m
Note: 's' is seconds and 'm' is minutes
Yeah, that makes sense even though it's odd.

Code:
mysql> SELECT outer.name, outer.date, outer.md5
    -> FROM files AS outer
    -> WHERE outer.md5 IN
    ->   (SELECT inner.md5
    ->    FROM files AS inner
    ->    WHERE inner.md5 IS NOT NULL
    ->    GROUP BY inner.md5
    ->    HAVING count(*) > 3
    ->   );
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'outer
WHERE outer.md5 IN
  (SELECT inner.md5
   FROM files AS inner
   WHERE i' at line 2

Guess I have to check the manual.

Quote:
Originally Posted by CRC123 View Post
What version of MySQL are you using?
5.0.67
 
Old 10-29-2008, 01:18 PM   #5
CRC123
Member
 
Registered: Aug 2008
Distribution: opensuse, RHEL
Posts: 374
Blog Entries: 1

Rep: Reputation: 32
Quote:
Originally Posted by robel View Post
Yeah, that makes sense even though it's odd.

Code:
mysql> SELECT outer.name, outer.date, outer.md5
    -> FROM files AS outer
    -> WHERE outer.md5 IN
    ->   (SELECT inner.md5
    ->    FROM files AS inner
    ->    WHERE inner.md5 IS NOT NULL
    ->    GROUP BY inner.md5
    ->    HAVING count(*) > 3
    ->   );
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'outer
WHERE outer.md5 IN
  (SELECT inner.md5
   FROM files AS inner
   WHERE i' at line 2

Guess I have to check the manual.



5.0.67
LOL , both 'outer' and 'inner' are SQL reserved words. They are used in join syntax's. Switch those keyboards to something else. Ex:
Code:
SELECT t1.name, t1.date, t1.md5 
FROM files AS t1
WHERE t1.md5 IN 
  (SELECT t2.md5 
   FROM files AS t2
   WHERE t2.md5 IS NOT NULL 
   GROUP BY t2.md5 
   HAVING count(*) > 3
  );
Sorry about that, the above syntax should work as long as you don't use reserved words as Aliases.
 
Old 10-29-2008, 04:12 PM   #6
robel
Member
 
Registered: Oct 2008
Location: Norway
Distribution: Slackware
Posts: 77

Original Poster
Rep: Reputation: 19
Quote:
Originally Posted by CRC123 View Post
LOL , both 'outer' and 'inner' are SQL reserved words. They are used in join syntax's.
Yes, it was my first thought. Seems to work now.

That is, the syntax is ok, but still it takes long time to execute. I have waited a few minutes now. Still no results. Oh, and the mysqld process is in 100% CPU utilization. It was a nice try and a plausible explanation. I still think you're explanation is close to true, but insinuating a MySQL bug is probably not wise...

Is there an alternative query to get the same results? I was pondering about a JOIN but couldn't figure out how to do it.

Edit: Here is the result: 334 rows in set (10 min 4.24 sec)
 
Old 10-29-2008, 04:18 PM   #7
robel
Member
 
Registered: Oct 2008
Location: Norway
Distribution: Slackware
Posts: 77

Original Poster
Rep: Reputation: 19
Quote:
Originally Posted by harry2006 View Post
one suggestion that might help, though i'm not sure, :-(
as far as my knowlegde is concerned, at teh lower level the query gets optimized and what i can say in your case is that this optimization is the default optimization available, so it might help a bit if you can provide your own procedures thereby asking the underlying "query manager"
to do a customised optimization. do think in this direction and let us know your views/issues etc. thank you.
lets see what other ppl/experts in this field have to say.
Perhaps by saving the subquery in a temporary table and join the two may help? I'll give it a try... Later. Now I have to get som sleep. Local time in Norway is 22:18 and I have to get up early.
 
Old 10-29-2008, 04:44 PM   #8
robel
Member
 
Registered: Oct 2008
Location: Norway
Distribution: Slackware
Posts: 77

Original Poster
Rep: Reputation: 19
Problem solved!

Got it! Thanks to all for guiding me to the answer. I've never come up with this alone.

Code:
SELECT files.name, files.date, files.md5  FROM files
INNER JOIN  
(SELECT md5 FROM files 
 WHERE md5 IS NOT NULL
 GROUP BY md5
 HAVING count(*) > 3 ) AS x
ON files.md5=x.md5;
334 rows in set (0.02 sec)

It's really cooking with gas! (It's all electrical of course, but then it's just a saying right? http://en.wiktionary.org/wiki/cooking_with_gas)

The clue was using a JOIN in the query.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
mysql console execute query in background kpachopoulos Programming 1 01-05-2008 09:22 AM
help optimizing this SQL statement hedpe Programming 1 05-30-2007 07:06 AM
php+mysql: how to execute a sql batch file prabhatsoni Linux - Software 7 04-19-2006 06:11 AM
Problem with SQL query using MySQL gigglesnorter Programming 2 04-22-2004 04:11 PM
ColdFusion - Using a script to build and execute SQL query when a button is clicked Locura Programming 1 02-25-2004 09:59 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:25 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration