LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-17-2014, 02:43 PM   #1
stateless
Member
 
Registered: Jan 2013
Distribution: Debian
Posts: 166
Blog Entries: 1

Rep: Reputation: 4
mysql optimization question for a large query comparing on a specific field


Hi. I'm not going to pretend to be a MySQL guru, and I would appreciate some guidance. I'm working with a MySQL table (MyISAM) in which the records the represent "Jobs" in our system. Each job has a jobKey primary key, a date field (for when the job was done) and a bunch of other data. There are over 700,000 records in the table.

This table is used for a lot of different things. However, for the Big Bosses I have to provide this Web-based report that needs to fetch all the jobs for any particular range of dates they select. For the typical range they will select, this will require fetching around 60,000 jobs. The probem is that the query has to go through the whole table to find everything that has an appropriate date, so the query will take like 5 seconds, even though only a fraction of the jobs are requested, and we know the range of dates we want.

I heard about "indexes", so I created a B-tree index on the date field. But this didn't help performance. (Which makes sense to me, since there are so many records and they would have to be read from multiple places in the table.)

I heard it was possible to do a (one-time) reordering the table by some field. My thought was to reorder the table once a day after the new jobs have been inserted. However, this doesn't help because apparently MySQL doesn't remember that the table has been ordered or use that to speed up the query.

My last thought was to create a separate "cache" table which contains data copied over from the Jobs table (once a day) but organized in some way by the date field. But I'm rather vague on the proper table syntax to get that organization I want, with the ideal performance improvements.
 
Old 02-17-2014, 04:24 PM   #2
Sydney
Member
 
Registered: Mar 2012
Distribution: Scientific Linux
Posts: 147

Rep: Reputation: 36
I have found that turning on slow query logs and then using the scripts at http://hackmysql.com/case1 to make indexes that are targeted to why the query is slow has solved issues like these for me in the past.

Last edited by Sydney; 02-17-2014 at 04:26 PM.
 
Old 02-17-2014, 04:32 PM   #3
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,198

Rep: Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307
MySQL's EXPLAIN statement is your query-optimizing friend.

However, I'm really unclear as to why the obvious solution isn't helping:

Code:
CREATE INDEX date_idx ON Jobs (date);

EXPLAIN SELECT * FROM Jobs WHERE date > [start_date] and date < [end_date];

Last edited by dugan; 02-17-2014 at 04:40 PM.
 
Old 02-17-2014, 04:43 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,774

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
I think you want a "clustered index":

Quote:
Clustering alters the data block into a certain distinct order to match the index, resulting in the row data being stored in order. Therefore, only one clustered index can be created on a given database table.
 
Old 02-17-2014, 05:05 PM   #5
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,258
Blog Entries: 24

Rep: Reputation: 4193Reputation: 4193Reputation: 4193Reputation: 4193Reputation: 4193Reputation: 4193Reputation: 4193Reputation: 4193Reputation: 4193Reputation: 4193Reputation: 4193
Usually creating an index on a table that size should be sufficient, but we don't know whether there are any additional JOINs or WHERE terms in the query either.

In a more general sense you are asking how to access arbitrary subsets of the data on demand, which is more in the realm of data warehousing. In fact, you allude to that by mention of separate "cache" tables, often called facts and dimensions (although there is more to that idea).

Here is a simple way that you might do that for your specific example (impromptu, view as pseudo code!)...

Code:
CREATE TABLE doneDateDimension(doneDate INT(11) unsigned PK);
CREATE TABLE jobFact(
   jobKey (jobKey type) PK,
   doneDate INT(11) unsigned
   ... other dimension keys here...
);

CREATE INDEX 'job_done_key' ON jobFact (doneDate);

You should have a process to populate these tables periodically, search for ETL, but for simple case...

REPLACE INTO jobFact(SELECT jobKey, UNIX_TIMESTAMP(DATE(doneDate)) FROM jobs); 
/* Force date only, not time, assumes datetime type in jobs table */

REPLACE INTO doneDateDimension(SELECT DISTINCT doneDate FROM jobFact);
At this point you have instant access to jobs by date...

Code:
SELECT [job columns] 
FROM dateDimension JOIN jobFact USING(doneDate) JOIN jobs USING(jobKey)
WHERE doneDate>=UNIX_TIMESTAMP(startDate) AND doneDate<UNIX_TIMESTAMP(endDate)
ORDER BY doneDate, jobKey;
The use of integer dates as keys in the dimension greatly speeds access so that you rather instantly arrive at the subset of jobKeys, everything after that is simple retreival.

With a suitable ETL process and additional dimensions you will have instant access to jobs by various slices through the data.

Last edited by astrogeek; 02-17-2014 at 05:17 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] MySQL: Comparing values in diff tables SQL Query and BASH watice Programming 4 12-24-2012 12:56 PM
Kernel Question and Ivy Bridge Optimization Query KellKheraptis Linux - General 1 06-10-2012 04:54 PM
[SOLVED] using awk and comparing a field's value oreka18 Programming 3 05-13-2012 07:04 AM
[SOLVED] Comparing two files and looking for the same line based on field - awk sopier Programming 8 12-26-2011 02:53 PM
Mysql/PHP query problem with datetime field. Pcghost Programming 2 11-11-2003 12:24 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:05 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration