LinuxQuestions.org - mysql optimization question for a large query comparing on a specific field

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - mysql optimization question for a large query comparing on a specific field (https://www.linuxquestions.org/questions/programming-9/mysql-optimization-question-for-a-large-query-comparing-on-a-specific-field-4175495267/)

mysql optimization question for a large query comparing on a specific field

Hi. I'm not going to pretend to be a MySQL guru, and I would appreciate some guidance. I'm working with a MySQL table (MyISAM) in which the records the represent "Jobs" in our system. Each job has a jobKey primary key, a date field (for when the job was done) and a bunch of other data. There are over 700,000 records in the table.

This table is used for a lot of different things. However, for the Big Bosses I have to provide this Web-based report that needs to fetch all the jobs for any particular range of dates they select. For the typical range they will select, this will require fetching around 60,000 jobs. The probem is that the query has to go through the whole table to find everything that has an appropriate date, so the query will take like 5 seconds, even though only a fraction of the jobs are requested, and we know the range of dates we want.

I heard about "indexes", so I created a B-tree index on the date field. But this didn't help performance. (Which makes sense to me, since there are so many records and they would have to be read from multiple places in the table.)

I heard it was possible to do a (one-time) reordering the table by some field. My thought was to reorder the table once a day after the new jobs have been inserted. However, this doesn't help because apparently MySQL doesn't remember that the table has been ordered or use that to speed up the query.

My last thought was to create a separate "cache" table which contains data copied over from the Jobs table (once a day) but organized in some way by the date field. But I'm rather vague on the proper table syntax to get that organization I want, with the ideal performance improvements.

I have found that turning on slow query logs and then using the scripts at http://hackmysql.com/case1 to make indexes that are targeted to why the query is slow has solved issues like these for me in the past.

MySQL's EXPLAIN statement is your query-optimizing friend.

However, I'm really unclear as to why the obvious solution isn't helping:

Code:

CREATE INDEX date_idx ON Jobs (date);



EXPLAIN SELECT * FROM Jobs WHERE date > [start_date] and date < [end_date];

I think you want a "clustered index":

Quote:

Clustering alters the data block into a certain distinct order to match the index, resulting in the row data being stored in order. Therefore, only one clustered index can be created on a given database table.

Usually creating an index on a table that size should be sufficient, but we don't know whether there are any additional JOINs or WHERE terms in the query either.

In a more general sense you are asking how to access arbitrary subsets of the data on demand, which is more in the realm of data warehousing. In fact, you allude to that by mention of separate "cache" tables, often called facts and dimensions (although there is more to that idea).

Here is a simple way that you might do that for your specific example (impromptu, view as pseudo code!)...

Code:

CREATE TABLE doneDateDimension(doneDate INT(11) unsigned PK);

CREATE TABLE jobFact(

  jobKey (jobKey type) PK,

  doneDate INT(11) unsigned

  ... other dimension keys here...

);



CREATE INDEX 'job_done_key' ON jobFact (doneDate);



You should have a process to populate these tables periodically, search for ETL, but for simple case...



REPLACE INTO jobFact(SELECT jobKey, UNIX_TIMESTAMP(DATE(doneDate)) FROM jobs); 

/* Force date only, not time, assumes datetime type in jobs table */



REPLACE INTO doneDateDimension(SELECT DISTINCT doneDate FROM jobFact);

At this point you have instant access to jobs by date...

Code:

SELECT [job columns] 

FROM dateDimension JOIN jobFact USING(doneDate) JOIN jobs USING(jobKey)

WHERE doneDate>=UNIX_TIMESTAMP(startDate) AND doneDate<UNIX_TIMESTAMP(endDate)

ORDER BY doneDate, jobKey;

The use of integer dates as keys in the dimension greatly speeds access so that you rather instantly arrive at the subset of jobKeys, everything after that is simple retreival.

With a suitable ETL process and additional dimensions you will have instant access to jobs by various slices through the data.