mysql optimization question for a large query comparing on a specific field
Hi. I'm not going to pretend to be a MySQL guru, and I would appreciate some guidance. I'm working with a MySQL table (MyISAM) in which the records the represent "Jobs" in our system. Each job has a jobKey primary key, a date field (for when the job was done) and a bunch of other data. There are over 700,000 records in the table.
This table is used for a lot of different things. However, for the Big Bosses I have to provide this Web-based report that needs to fetch all the jobs for any particular range of dates they select. For the typical range they will select, this will require fetching around 60,000 jobs. The probem is that the query has to go through the whole table to find everything that has an appropriate date, so the query will take like 5 seconds, even though only a fraction of the jobs are requested, and we know the range of dates we want. I heard about "indexes", so I created a B-tree index on the date field. But this didn't help performance. (Which makes sense to me, since there are so many records and they would have to be read from multiple places in the table.) I heard it was possible to do a (one-time) reordering the table by some field. My thought was to reorder the table once a day after the new jobs have been inserted. However, this doesn't help because apparently MySQL doesn't remember that the table has been ordered or use that to speed up the query. My last thought was to create a separate "cache" table which contains data copied over from the Jobs table (once a day) but organized in some way by the date field. But I'm rather vague on the proper table syntax to get that organization I want, with the ideal performance improvements. |
I have found that turning on slow query logs and then using the scripts at http://hackmysql.com/case1 to make indexes that are targeted to why the query is slow has solved issues like these for me in the past.
|
MySQL's EXPLAIN statement is your query-optimizing friend.
However, I'm really unclear as to why the obvious solution isn't helping: Code:
CREATE INDEX date_idx ON Jobs (date); |
I think you want a "clustered index":
Quote:
|
Usually creating an index on a table that size should be sufficient, but we don't know whether there are any additional JOINs or WHERE terms in the query either.
In a more general sense you are asking how to access arbitrary subsets of the data on demand, which is more in the realm of data warehousing. In fact, you allude to that by mention of separate "cache" tables, often called facts and dimensions (although there is more to that idea). Here is a simple way that you might do that for your specific example (impromptu, view as pseudo code!)... Code:
CREATE TABLE doneDateDimension(doneDate INT(11) unsigned PK); Code:
SELECT [job columns] With a suitable ETL process and additional dimensions you will have instant access to jobs by various slices through the data. |
All times are GMT -5. The time now is 05:33 AM. |