Referential Integrity, Join Culling, and Performance

Unlike some traditional BI tools, Tableau dynamically writes the simplest query it can based on what the user drags and drops into the view, versus specifying a different unique optimized query per view. It also uses simple queries that take into account the full JOIN structure like (simplified pseudo-SQL here):

TOP 1 FROM (SELECT fields FROM table t1 JOIN t2 JOIN t3 JOIN t4) t0

to pull metadata information.

This strategy tends to work well for many of the systems Tableau runs on top of, but occasionally runs into trouble if the RDBMS’s query optimizer doesn’t cull out unnecessary JOINS before trying to work out the query. I’d say particularly for systems intended to be DWs the issue does arise (Redshift performance, for example, can be immensely affected if JOINs are not culled out, particularly when users try and include every possible table in the schema).

Join Culling and Referential Integrity

 

Tableau does have a feature called Join Culling which will remove unnecessary JOINs from any query, but it only works under the following conditions:

  1. The “Assume Referential Integrity” option has been selected for the data source (you right-click on the data source to turn this on) OR The database must have primary key / foreign key relationships actually established so Tableau can know when referential integrity can be assumed. Tableau will join cull automatically if the relationships are defined in the database, but when you choose “Assume Referential Integrity” it will, well, assume they exist and act accordingly
  2. All JOINs are INNER JOINs. Only INNER JOINs can be culled. I’ve been informed that there are times when LEFT and RIGHT JOINs are culled, but from experience, your best shot at culling is using an INNER. My initial read is that “Assume Referential Integrity” only can assume on an INNER JOIN at this time.
  3. When using “Assume Referential Integrity”, the JOINs should be as close to a “pure” star schema as possible. Use only a single Fact table, with one level of Dimension tables attached directly to the Fact table.

This is the situation to ensure join culling:

cullable star

 

If any of the JOINs are LEFT OUTER JOINs, they won’t cull. So in this situation, only DimCustomer and DimShipMode will be culled, and DimProduct and DimTime will always be included in the JOIN clauses of EVERY query:

cullable inner with lefts

I have tested with customers the following situation where one dimension table that had an additional dimension table attached to it, with an INNER JOIN between the two outer tables, and in this case, join culling seemed to work just fine as well. :

 

cullable multi star

 

When they changed the join between DimCustomer and DimCustomerInfo to a LEFT JOIN, that full set of JOINs was always included:

cullable multi star no cull

So if you do need a LEFT join on a dimension but want Join Culling to still work, you need to make a View in the database of this combination, and then use that View in Tableau joins dialog so that Tableau believes it is just a table.

Major hat tip to Russell Christopher who found all this stuff out three years ago. I’m just restating it because it’s even more relevant today than ever.

“Semantic Layers”

Tableau data sources aren’t intended to be a single “semantic layer” like many other tools. Because you can have multiple data sources in a single workbook (and these can filter from one to another using dashboard actions), it is recommended that you only make a data source as complex as needed to do a particular analysis. You can always duplicate the data source and add additional tables quickly if more data is needed for an analysis. Tableau’s philosophy is get started answering your questions, and go back to add more if you need it.

Particularly if you don’t have a data warehouse set up in a full star schema with referential integrity, you’ll get much better performance by only building your data sources to be as complex as needed, rather than overbuilding them at the beginning to handle everything. This advice will even help when building extracts, because the less fields you bring in, the quicker the extract will generate.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s