Thoughts on MPP databases and Tableau

As the recent post on Vertica brings to light, sometimes really highly performing systems need a little configuration to perform optimally with Tableau. There’s a particular set of systems that require some extra thought and care to use with Tableau, because if you set off without any planning and expect to combine Tableau’s ease of use with the speed of these systems and end up staring at the “query executing” screen for 10 minutes, you may start to doubt everyone’s claims.

The systems I’m talking about are the Massive Parallel Processing (MPP) databases. There’s already a great explanation of them here so I’m not going to go too deeply into how they work, other than what is relevant for Tableau. Which systems that Tableau supports are MPP (don’t get too angry if I get this a bit wrong) (in no particular order):

  • Teradata
  • Vertica
  • Redshift
  • Neteeza
  • Greenplum
  • Aster (although there is some Hadoop going on in the backend)
  • ParAccel

System Configuration

The most basic description of an MPP is that that data is split up (“partitioned”) across multiple machines in a cluster. When a query comes in, some process in the cluster analyses the query, figures out how to split that query into requests to the workers, then sets them off working. When the workers are finished, their results are stitched together and sent back as a single response.

What makes this sort of system fast?

  1. Enough nodes to actually get benefits. If you only have one or two nodes, the benefits of having parallel processing may not be greater than the cost of managing the partitioning and putting it back together. Most of the “Redshift is slow” complaints we hear result from underpowered systems.
  2. Correct partitioning at the right granularity. MPPs require a lot of design choices, particularly in how you split up the data across the nodes. Granularity should match between tables that will be JOINed, and the partitioning should match. For example, if you are partitioning on YEAR, all of the data for that year should be on the same node.
  3. Smaller tables that provide dimensional information should be replicated across nodes. If nodes are constantly having to retrieve information across the network from other nodes, things will be SLOW. Network saturation from data transfer can wreck MPP performance.
  4. Views (or their equivalent) that are optimized

Tableau Configuration

The two biggest things that can improve performance are

  1. Choosing “Assume Referential Integrity” on the Data Connection. This allows Tableau to drop unnecessary JOIN clauses in many cases, which can help the query optimizer in the DB make the most efficient requests.
  2. Get rid of Custom SQL! Do your JOINs using the Tableau connection screen, then add filters and let Tableau do the rest of the work. If you definitely need Custom SQL for business logic reasons, make a View in the database then connect to it like it is a table in Tableau. When you use Custom SQL in Tableau for a live connection, Tableau uses that whole query as a subselect table in every subsequent query, which can result in overly complex queries that confuse the query planner/ optimizer in the database. Particularly with MPPs, the optimizer needs to figure things out correctly for things work at the right speed.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s