Partitioning with Dates in MySQL 5.1
Although I am still busy with writing my thesis paper, I do need a distraction from time to time. Since I started getting phpOpenTracker questions again recently, I decided to (slowly) start working on phpOpenTracker 2.0 again. phpOpenTracker 2.0 will be a complete rewrite, utilizing the features of PHP 5.2 and MySQL 5.1.
The biggest problem with phpOpenTracker 1.x is that its database schema is normalized. Even simple phpOpenTracker API calls result in (possibly multiple) SQL queries that involve multiple tables. These (more often than not rather complicated) JOINs degrade in performance the more data is stored in the database. The database schema for phpOpenTracker 2.0 is denormalized and only consists of two tables (current draft).
While the denormalization should eliminate the JOIN-related performance problems, temporal partitioning of the data can further increase the performance. The backend for MySQL in phpOpenTracker 1.x optionally uses MyISAM Merge Tables for this: Rows for table X that are INSERTed on day D of month M of year Y are not INSERTed into table X directly, but into the MyISAM table X_yyyymmdd. Table X_yyyymm is a MyISAM Merge Table that aggregates all X_yyyymmdd tables of that month, X_yyyy is a MyISAM Merge Table that aggregates all X_yyyymmdd tables of that year. X finally aggregates all X_yyyymmdd tables. SELECT queries that operate on this table X and contain BETWEEN expressions that operate on the timestamp column of the table are transparently rewritten to use the smallest table possible (X_yyyymmdd, X_yyyymm, X_yyyy, X), thus reducing the amount of data that needs to be looked at to process the query.
The Partitioning feature of MySQL 5.1 should make this application-level partitioning with dates obsolete.
Another issue I have with the partitioning feature in MySQL 5.1 is the rather verbose syntax for defining such a partition:
The biggest problem with phpOpenTracker 1.x is that its database schema is normalized. Even simple phpOpenTracker API calls result in (possibly multiple) SQL queries that involve multiple tables. These (more often than not rather complicated) JOINs degrade in performance the more data is stored in the database. The database schema for phpOpenTracker 2.0 is denormalized and only consists of two tables (current draft).
While the denormalization should eliminate the JOIN-related performance problems, temporal partitioning of the data can further increase the performance. The backend for MySQL in phpOpenTracker 1.x optionally uses MyISAM Merge Tables for this: Rows for table X that are INSERTed on day D of month M of year Y are not INSERTed into table X directly, but into the MyISAM table X_yyyymmdd. Table X_yyyymm is a MyISAM Merge Table that aggregates all X_yyyymmdd tables of that month, X_yyyy is a MyISAM Merge Table that aggregates all X_yyyymmdd tables of that year. X finally aggregates all X_yyyymmdd tables. SELECT queries that operate on this table X and contain BETWEEN expressions that operate on the timestamp column of the table are transparently rewritten to use the smallest table possible (X_yyyymmdd, X_yyyymm, X_yyyy, X), thus reducing the amount of data that needs to be looked at to process the query.
The Partitioning feature of MySQL 5.1 should make this application-level partitioning with dates obsolete.
- [Partitioning] enables distributing portions of individual tables across a filesystem, according to rules which can be set when the table is created. In effect, different portions of a table are stored as separate tables in different locations, but from the user point of view, the partitioned table is still a single table. [...] Queries against partitioned tables can take advantage of partition pruning. In some cases, this can result in query execution that is an order of magnitude faster than the same query against a non-partitioned version of the same table.
Another issue I have with the partitioning feature in MySQL 5.1 is the rather verbose syntax for defining such a partition:
mysql> CREATE TABLE part_date3 -> ( c1 int default NULL, -> c2 varchar(30) default NULL, -> c3 date default NULL) engine=myisam -> partition by range (to_days(c3)) -> (PARTITION p0 VALUES LESS THAN (to_days('1995-01-01')), -> PARTITION p1 VALUES LESS THAN (to_days('1996-01-01')) , -> PARTITION p2 VALUES LESS THAN (to_days('1997-01-01')) , -> PARTITION p3 VALUES LESS THAN (to_days('1998-01-01')) , -> PARTITION p4 VALUES LESS THAN (to_days('1999-01-01')) , -> PARTITION p5 VALUES LESS THAN (to_days('2000-01-01')) , -> PARTITION p6 VALUES LESS THAN (to_days('2001-01-01')) , -> PARTITION p7 VALUES LESS THAN (to_days('2002-01-01')) , -> PARTITION p8 VALUES LESS THAN (to_days('2003-01-01')) , -> PARTITION p9 VALUES LESS THAN (to_days('2004-01-01')) , -> PARTITION p10 VALUES LESS THAN (to_days('2010-01-01')), -> PARTITION p11 VALUES LESS THAN MAXVALUE ); Query OK, 0 rows affected (0.00 sec)
31/01/2007 at 16:59 Permalink
--------------------------------------------------------------------
I'm not following how that 1024 limitation translates into less than a year... looks like your calculation implies it would cover 2.8 years. I'm obviously missing something, which is why I'm asking.
On another note, I'd have to have really hard-hit table, getting lots of inserts every day, to need to partition it out on a daily level. It seems to me this partitioning just saves time on how big a full table scan ultimately is, based on how many partitions are lumped together to make the "full table". I'd lean more towards setting a partition scheme based on how long on average it took my table to get several hundred new rows.
Then again, I'm used to Oracle rather than MySQL, so maybe full table scans over a few hundred rows are more of a performance hit than I'm used to?
Reply
02/02/2007 at 16:57 Permalink
Oracle optimization is often quite different than MySQL optimization. Although this temporal partitioning makes some issues similar.
I agree that for Oracle it would be better to know how many rows are inserted over a period of time before decided how to split up the data. I haven't tried this feature in MySQL.
Reply
01/02/2007 at 16:52 Permalink
Reply
01/02/2007 at 16:55 Permalink
Reply
01/02/2007 at 17:14 Permalink
Reply
06/02/2007 at 03:02 Permalink
Reply
25/02/2007 at 21:13 Permalink
By the way: Since we've talked last time I've hade some thoughts about how to implement a well performing logging engine. Sebastian Nohn suggested to log at first into files and then store that data into the database, which is the best solution I think.
Nevertheless this doesn't solve your reading problem, therefore a denormalized schema is in fact the correct approach, however I think yours isn't the best yet. I'm going to test a different version, which I could mail you afterwards if you want.
Reply
22/11/2010 at 18:19 Permalink
http://www.tritux.com/blog/2010/11/19/partitioning-mysql-database-with-high-load-solutions/11/1
It demonstrate how partitioning can be a good solution for scalable architectures.
Reply