The other day I tweeted that MySQL unions are my new best friend. I just want to take that back. I just spent the past six hours (it's now 5 am) overhauling a MySQL query and removing one of the unions actually sped up performance. Dazah is really my first foray into the world of big data. I've found the hardest problem so far has been writing queries that will work regardless of the subset of the data size they are working against. For example, there are multiple bubbles and audience segments on Dazah, of all sizes and each favoring a different behavior type (some bubbles are more chatty than others), and I've found that every time I optimize a query for one dataset it fubars with a different one. Just a rant.

Remember that UNION (and its cousins INTERSECT and EXCEPT) essentially require each of the provided queries to be run separately and the results collated. In this example (using PostgreSQL, but the same applies in MySQL), we can see exactly what's happening.

I have a small table with some users and their favourite colours:

peter=# select * from users;
│ id │  name   │ favourite_colour │
│  3 │ Francis │ blue             │
│  4 │ Toby    │ blue             │
│  1 │ Joey    │ red              │
│  2 │ Dwayne  │ purple           │
(4 rows)

Time: 0.400 ms

Now, let's say we want all users with a favourite colour of blue or red; if we use a UNION the following happens:

peter=# explain select name from users where favourite_colour = 'blue' union select name from users where favourite_colour = 'red' ;
│                                QUERY PLAN                                 │
│ HashAggregate  (cost=35.33..35.39 rows=6 width=58)                        │
│   Group Key:                                                   │
│   ->  Append  (cost=0.00..35.31 rows=6 width=58)                          │
│         ->  Seq Scan on users  (cost=0.00..17.62 rows=3 width=58)         │
│               Filter: ((favourite_colour)::text = 'blue'::text)           │
│         ->  Seq Scan on users users_1  (cost=0.00..17.62 rows=3 width=58) │
│               Filter: ((favourite_colour)::text = 'red'::text)            │
(7 rows)

Time: 3.868 ms

As you can see, the query plan involves

  • two sequental scans (Seq Scan) that each perform a Filter,
  • an Append operation which is actually performing the UNION,
  • plus Group Key and HashAggregate steps, from which the resulting recordset can be returned.

Now, if we perform the same query in another fashion, we can see how much simpler the query plan is:

peter=# explain select name from users where favourite_colour in ('blue', 'red') ;
│                            QUERY PLAN                             │
│ Seq Scan on users  (cost=0.00..17.62 rows=6 width=58)             │
│   Filter: ((favourite_colour)::text = ANY ('{blue,red}'::text[])) │
(2 rows)

Time: 0.510 ms

Unions, along with the other set operations, can be incredibly useful when combining data from different sources but they are unlikely to optimise queries from a single table or relation.

Additionally, two things I noticed while typing this reply:

  • the editor isn't resizable, this makes it difficult to refer to what I've typed while formulating a response. This is more prevelant in posts with big code snippets, tables, etc.
  • there appears to be a bug that occurs if a multi-line code snippet follows a list directly; the code snippet isn't recognised until there's a line of text between them

This is very well written. It highlights some of the issues I have as well. Great article to explain how to go about cleaning up code. I've spent years doing just that but it's often unappreciated.