Defining a Tableau Data Source Programmatically

While the most preferred method for making variants of a Tableau Data Source is building the data source in Desktop then modifying the XML, there are some situations that warrant generating a full TDS file from scratch. At minimum, working through this process gives a much clearer view on what parts of the XML are responsible for various definitions and how they can be safely changed programmatically. Things like security calculations, filter variations, creating arbitrary data sources from any table, building infinite extract variations etc. should all be possible knowing how this works.

I’m implementing this all in actual code that is functioning and works right along as part of the tableau_rest_api library; I should have that available sometime soon. Read below the fold if you want to know how it all works (or at least what I’ve discovered so far):

Basic Structure of a TDS file

The Tableau Data Source file is an XML file, which means it is human readable, and it is valid, so it can be generated and read via standard XML libraries. However, it is not a standard, was added to by many people over the years, and is written and interpreted only by Tableau Desktop and Server, so it’s fairly idiosyncratic.

The root node is a datasource tag. Everything is contained within there. Actually, there is an XML doctype declaration at the top; all TDS files are in UTF-8 encoding. So the start of a TDS would look like:


<?xml version='1.0' encoding='utf-8' ?>

<datasource formatted-name='sqlserver.1che7bg0l7puol1bjjfeu02cm4az' inline='true' source-platform='win' version='9.3' xmlns:user='http://www.tableausoftware.com/xml/user'>

The formatted-name attribute can be anything you want, although Desktop-generated names always seem to just be the data source type plus a random string. I don’t know that any of the other stuff matters except for the xlmns:user declaration.

Inside the root datasource node, there is a connection tag. The first part of this is a relation section, which defines the join structure. After this comes a cols tag section, with map elements inside. These map elements define the way that Tableau refers to the field names from the database in the rest of the TDS XML. After the cols section, you may see a connection-customization section. These get created when there is a TDC file in play, but in most cases the enabled attribute is set to false, which means the whole section can be ignored.

The next section is usually the largest (although still enclosed in the connection tag): metadata-records. The best part is that it can be ignored completely in most cases — Tableau Desktop and Server both do metadata checks when a source is loaded up the first time and will write those results back into the TDS, so you can delete (or in our case, never create) that section, and Tableau will do the work for you to recreate it. You might run into issues if you have some kind of TDC modifications which turn off metadata querying, but for the use cases here, there’s no point in delving into the metadata-records too deeply.

After the metadata-records element is closed, the closing tag for the connection appears. After this comes a section that defines Column Aliases; it has an interesting structure that you can read about below. Column Aliases define all of the fun metadata in a TDS — the field names that people see, calculation definitions, hierarchies, what is a dimension vs. a measure, etc.

Next, any Data Source Filters are defined using filter elements. There is no larger collection (no filters tag), but these all line up one after another and group together.

Finally, any information about an included extract will be defined in the extract element section.

After all this, at the bottom, there may be a layout tag and a semantic-values tag. I don’t know exactly what they do yet, but they don’t seem too essential. semantic-values appears to be where the default country for mapping may be assumed; presumably it may do other things as well.

JOINs

As it stands today, the Tableau Data Source defines a set of relationships to a single “base” table, with any number of JOINs that go out from there. You can think of it as a standard SELECT statement in SQL; a FROM statement followed by any number of INNER, LEFT, or RIGHT JOINs with any number of ON clauses.

Complex joins

Here’s what the XML looks like for this


<connection authentication='sspi' class='sqlserver' dbname='Superstore Star Schema' odbc-native-protocol='yes' one-time-sql='' server='demo-server'>
<relation join='inner' type='join'>
<clause type='join'>
<expression op='='>
<expression op='[DimCustomer2].[IDCustomer]' />
<expression op='[DimCustomer3].[IDCustomer]' />
</expression>
</clause>
<relation join='inner' type='join'>
<clause type='join'>
<expression op='='>
<expression op='[Orders].[IDProduct]' />
<expression op='[DimProduct].[IDProduct]' />
</expression>
</clause>
<relation join='left' type='join'>
<clause type='join'>
<expression op='='>
<expression op='[DimCustomer].[IDCustomer]' />
<expression op='[DimCustomer2].[IDCustomer]' />
</expression>
</clause>
<relation join='inner' type='join'>
<clause type='join'>
<expression op='='>
<expression op='[DimCustomer].[IDCustomer]' />
<expression op='[DimCustomer1].[IDCustomer]' />
</expression>
</clause>
<relation join='inner' type='join'>
<clause type='join'>
<expression op='='>
<expression op='[Orders].[IDCustomer]' />
<expression op='[DimCustomer].[IDCustomer]' />
</expression>
</clause>
<relation name='Orders' table='[dbo].[OrderFact]' type='table' />
<relation name='DimCustomer' table='[dbo].[DimCustomer]' type='table' />
</relation>
<relation name='DimCustomer1' table='[dbo].[DimCustomer]' type='table' />
</relation>
<relation name='DimCustomer2' table='[dbo].[DimCustomer]' type='table' />
</relation>
<relation name='DimProduct' table='[dbo].[DimProduct]' type='table' />
</relation>
<relation name='DimCustomer3' table='[dbo].[DimCustomer]' type='table' />
</relation>

How to describe this? Each additional JOIN clause ends up becoming a relation which “wraps around” the existing relation elements. What is interesting about this is that the first table, the FROM table, ends up in the middle of the whole thing along with the first table that is JOINed. When I explained this to my co-worker Matt, he said “It’s the middle-out algorithm!” So the secret is out, Tableau is actually Pied Piper. The one thing that is does require, though, is knowing ALL of the tables to be JOINed and their ON clauses  prior to trying to build out this section of the XML. Also, you’ll notice that the name attribute of the relation tag when there is an type=’table’ attribute is what is used in the expression tags; this is what allows the same table to be utilized multiple times.

If you think about a JOIN statement like:


FROM some_table a

LEFT JOIN other_table b ON b.id_field = a.id

The name attribute in the XML is equivalent to the alias in the JOIN clause.

Initial SQL

Initial SQL (with Parameters since 9.3) is stored within the connection element using the one-time-sql attribute.


<connection authentication='sspi' class='sqlserver' dbname='Superstore Star Schema' odbc-native-protocol='yes' one-time-sql='IF @@rowcount = 0
BEGIN
SELECT 1 FROM [Orders]
END' server='demo-dbs'>

Custom SQL

Custom SQL is represented like any other relation, except that the type attribute is set to text, and then the Custom SQL query is in the text portion of the tag:


<relation name="Cussed Em Sequel" type="text">
SELECT [OrderFact].[OrderID] AS [OrderID],
[OrderFact].[IDProduct] AS [IDProduct],
[OrderFact].[IDShipMode] AS [IDShipMode],
[OrderFact].[IDCustomer] AS [IDCustomer],
[OrderFact].[IDOrderPriority] AS [IDOrderPriority],
[OrderFact].[OrderDate] AS [OrderDate],
[OrderFact].[ShipDate] AS [ShipDate],
[OrderFact].[OrderQuantity] AS [OrderQuantity],
[OrderFact].[Sales] AS [Sales],
[OrderFact].[Discount] AS [Discount],
[OrderFact].[Profit] AS [Profit],
[OrderFact].[UnitPrice] AS [UnitPrice],
[OrderFact].[ShippingCost] AS [ShippingCost],
[OrderFact].[ProductBaseMargin] AS [ProductBaseMargin],
[DimCustomer].[IDCustomer] AS [IDCustomer (DimCustomer)],
[DimCustomer].[CustomerName] AS [CustomerName],
[DimCustomer].[State] AS [State],
[DimCustomer].[ZipCode] AS [ZipCode],
[DimCustomer].[Region] AS [Region],
[DimCustomer].[CustomerSegment] AS [CustomerSegment]
FROM [dbo].[OrderFact] [OrderFact]
INNER JOIN [dbo].[DimCustomer] [DimCustomer] ON ([OrderFact].[IDCustomer] = [DimCustomer].[IDCustomer])
</relation>

Column Mapping

After all the stuff in the relation section, there is a cols tag which defines a mapping between the fields in the various tables and the rest of the data source file. These are non-negotiable; if you don’t set them up the way that Tableau prefers, it will rewrite this section and any attempt to build aliases later will fail. The pattern is as follows:

  • If the field name is unique to a single table, it will be referenced exactly as that column name in the database, with square brackets around it (MS SQL Server notation). Even if you are using a PostgresSQL database, if the field is “mY aWESOme fiELD” in PostgreSQL, it will be represented as “[mY aWESOme fiELD]” in the key attribute.
  • If the field name is used in multiple tables, then the field from the FIRST table that was joined will be exactly the same as a unique field. The other fields will be named with the pattern: [column_name (Table Alias)]
  • The value attribute is on the pattern: [Table Alias].[column_name]

<cols>

<!-- This set shows how the CustomerName field is represented despite being in many tables -->
<map key='[CustomerName (DimCustomer1)]' value='[DimCustomer1].[CustomerName]' />
<map key='[CustomerName (DimCustomer2)]' value='[DimCustomer2].[CustomerName]' />
<map key='[CustomerName (DimCustomer3)]' value='[DimCustomer3].[CustomerName]' />
<map key='[CustomerName]' value='[DimCustomer].[CustomerName]' />

<!-- Discount is unique to the Orders table, only has one entry with no tablea alias reference -->
<map key='[Discount]' value='[Orders].[Discount]' />

<!-- Notice on this set that IDCustomer exists in Orders, so you get a (DimCustomer) whereas that is the first table in the first set, and it doesn't have the additional identifier -->
<map key='[IDCustomer (DimCustomer)]' value='[DimCustomer].[IDCustomer]' />
<map key='[IDCustomer (DimCustomer1)]' value='[DimCustomer1].[IDCustomer]' />
<map key='[IDCustomer (DimCustomer2)]' value='[DimCustomer2].[IDCustomer]' />
<map key='[IDCustomer (DimCustomer3)]' value='[DimCustomer3].[IDCustomer]' />
<map key='[IDCustomer]' value='[Orders].[IDCustomer]' />
<map key='[IDOrderPriority]' value='[Orders].[IDOrderPriority]' />
<map key='[IDProduct (DimProduct)]' value='[DimProduct].[IDProduct]' />
<map key='[IDProduct]' value='[Orders].[IDProduct]' />

</cols>

Column Aliases

The important metadata information about columns, such as aliases, default aggregations, whether they are measures or dimensions, or discrete vs. continuous, are all actually assigned in via column tags, which always follow an aliases tag with the the attribute enabled=’yes’. That is to say, the column tags are not members of the alias tag; instead each one is discrete, but they do follow the alias tag every time as though they were a section.


<aliases enabled='yes' />
<column caption='Best Calc Ever' datatype='integer' name='[Calculation_6974578775546734302]' role='dimension' type='ordinal'>
<calculation class='tableau' formula='IIF([salesperson_user_id]=USERNAME() AND [segment] = &amp;quot;Consumer&amp;quot;,1,0)' >
</column>
<column datatype='string' name='[city]' role='dimension' semantic-role='[City].[Name]' type='nominal' />
<column datatype='string' name='[country]' role='dimension' semantic-role='[Country].[ISO3166_2]' type='nominal' />
<column datatype='string' name='[region]' role='dimension' type='nominal' />
<column datatype='integer' name='[row_id]' role='measure' type='quantitative' />
<column datatype='string' name='[state]' role='dimension' semantic-role='[State].[Name]' type='nominal' />

You’ll notice that there is a calculation field, which we will talk about in the next section. Let’s look at the basic column tag for a field that exists in the database.

  • The name attribute must reference the key attribute of one of the map elements in the cols element.
  • The role attribute defines whether the field shows up in the Dimensions or Measure pane. It only takes ‘measure’ or ‘dimension’ as values
  • The type attribute can take three values: If a field is continuous, the type will be ‘quantitative’. If the field is a Discrete Dimension with string values, it will be ‘nominal’; if it is Discrete Dimension with numeric values, it will be ‘ordinal’.
  • The caption attribute is the “pretty name” that will be visible when the data source is loaded in Desktop or Server Web Edit.
  • The datatype attribute is the type of that field, or at least how Tableau should be treating it. These can be ‘string’, ‘integer’, ‘datetime’, ‘date’, ‘real’, ‘boolean’
  • The semantic-role attribute appears to be related to Geographic Roles. I haven’t experimented with it enough to understand the full range of values

Column Instances

 

Calculations

Calculations are defined as column aliases, with an additional calculation tag inside. They are assigned an internal name which follows the pattern “Calculation_” + a 19-digit random number. If you make any Data Source Filters, you have to use this automatically generated name as the reference to this column. The calculation is simply plain text in the formula attribute, however any single or double quote in the calculation definition must be encoded:


<column caption='Best Calc Ever' datatype='integer' name='[Calculation_6974578775546734302]' role='dimension' type='ordinal'>
<calculation class='tableau' formula='IIF([salesperson_user_id]=USERNAME() AND [segment] = &amp;quot;Consumer&amp;quot;,1,0)' />
</column>

Hierarchies

Hierarchies are defined in a drill-paths node which sits in the same level as the column tags. They appear directly after the last column tag. Within the drill-paths, each drill-path tag takes a name attribute. Then within that, a set of field tags defines the order of the hierarchy. The actual node text value will be the value from the column tag’s name attribute.


<drill-paths>
<drill-path name='Products'>
<field>[ProductCategory]</field>
<field>[ProductSubCategory]</field>
<field>[ProductName]</field>
</drill-path>
</drill-paths>

Data Source Filters

After all of that, any data source filters that exist appear as filter elements. There are two types, categorical (discrete) and quantitative (continuous).

Categorical

Categorical filters list out a set of members. For example, this keeps filter keeps two of the values in the CustomName column (again here, this is the key attribute from the map element in the cols element. Categorical filter elements have some number of internal groupfilter elements, which change depending on whether there is a single selected value, multiple, or if it is set to be an Exclude filter. Here are some examples:


<!-- Single include, all values are collapsed into a single groupfilter element -->

<filter class='categorical' column='[category]' filter-group='2'>
<groupfilter function='member' level='[category]' member='&quot;Furniture&quot;' user:ui-domain='database' user:ui-enumeration='inclusive' user:ui-marker='enumerate' />
</filter>

<!-- Include of multiple values; there is an outer groupfilter element with a function='union', with internal function='member' -->

<filter class='categorical' column='[CustomerName]' filter-group='2'>
<groupfilter function='union' user:ui-domain='database' user:ui-enumeration='inclusive' user:ui-marker='enumerate'>
<groupfilter function='member' level='[CustomerName]' member='&quot;Aaron Bergman&quot;' />
<groupfilter function='member' level='[CustomerName]' member='&quot;Aaron Hawkins&quot;' />
</groupfilter>
</filter>

<!-- Exclude of multiple values. Notice the additional interior function='union' -->

<filter class='categorical' column='[region]' filter-group='2'>
<groupfilter function='except' user:ui-domain='database' user:ui-enumeration='exclusive' user:ui-marker='enumerate'>
<groupfilter function='level-members' level='[region]' />
<groupfilter function='union'>
<groupfilter function='member' level='[region]' member='&quot;South&quot;' />
<groupfilter function='member' level='[region]' member='&quot;West&quot;' />
</groupfilter>
</groupfilter>
</filter>

One thing to notice is that the level attribute is included in the groupfilter elements with  function=member, matching the column attribute from the filter class. My understanding is this is probably different when connected to a cube data source (it might also be part of the hierarchies functionality, but that doesn’t seem to make too much sense).

Quantitative

Quantitative filters are those called “Continuous” in the Tableau Desktop UI. They includes simple range filters on numeric fields, as well as range filters on time fields. An interesting quirk is that there MUST be a column element in the alias section defined for a given field that you want to filter on, or Tableau will throw up an error.


<column name="[profit]" datatype="real" caption="Profit" role="measure" type="quantitative"/>

<filter class="quantitative" column="[profit]" filter-group="2" include-values="in-range">
<min>20</min>
</filter>

Continuous Date Filters

Continuous filters on date fields have some additional requirements:

  • The column name follows a pattern like this: [none:{field_name}:qk]
  • There must be a column-instance that matches the column name pattern as specified above
  • Tableau dates are written in the SQL Server style, wrapped in #: #2012-01-03#

<column-instance column="[order_date]" derivation="None" name="[none:order_date:qk]" pivot="key" type="quantitative"/>

<filter class="quantitative" column="[none:order_date:qk]" filter-group="2" include-values="in-range">
<min>#2012-03-01#</min>
<max>#2014-04-23#</max>
</filter>

Relative Date Filters

Relative date filters are their own filter class, with different attributes that map to the dialog box. They also require a column-instance, but one that exactly matches the same type used for Continuous Date Filters.


<!-- Last 2 quarters -->

<filter class='relative-date' column='[none:order_date:qk]' filter-group='2' first-period='-2' include-future='true' include-null='false' last-period='0' period-type='quarter' />

<!-- Quarter to Date -->

<filter class='relative-date' column='[none:order_date:qk]' filter-group='2' first-period='0' include-future='false' include-null='false' last-period='0' period-type='quarter' />

There are four functional attributes that give all the combinations:

  • first-period: This is a integer relative to today, which is represented as 0. In the upper example, -2 is two years back from today.
  • last-period: This is a integer relative to today, represented as 0. In the first example, which is a “Last 2 Quarters”, this is set to 0. In the quarter to date example, both first-period and last -period are 0.
  • period-type: This is one of the valid timeparts that Tableau accepts (all lowercase)
  • include-future: When this is ‘true’, the filter acts as a To Date filter; when it is ‘false’, the filter covers the whole of a period specified (so always whole months, quarters, etc.)

Extracts

Did you know that a Tableau Data Extract file (a .tde) has no concept of what generated it, and thus by itself cannot be updated? That is why when you connect to a random TDE, there is no option to refresh the extract, or option to revert to a live connection. When you save an extract as a .TDSX file, or publish it to Tableau Server, the TDS is saved/sent along, and that is what controls any refreshing of the extract.

This leads us to an interesting conclusion: It doesn’t matter what TDE you have in a packaged datasource; if you replace the TDS and refresh it, the TDE will be built off the TDS definition, regardless of the source of the original TDE. It could be a blank extract with zero rows and still work. There’s long been a need to publish ’empty’ extracts; this provides a mechanism to do that without any tricks inside Desktop.

The extract that is contained within any TDSX file is described in the TDS:


<extract count="-1" enabled="true" units="records">
<connection class="dataengine" dbname="Data/Datasources/extract.tde" schema="Extract" tablename="Extract" update-time="06/28/2016 11:36:15 AM">
<relation name="Extract" table="[Extract].[Extract]" type="table"/>
<calculations>
<calculation column="[Number of Records]" formula="1"/>
</calculations>
<refresh increment-key="" incremental-updates="false"/>
</connection>

</extract>

Other than the name of the TDE file, nothing ever needs to be changed here for this to work when added to any TDS file inside a TDSX (obviously there is a refresh tag which can help define incremental updates if necessary). The TDE file in a TDSX (which is just a ZIP file) is stored in the Data/Datasources/ directory in the ZIP file. The TDS file is in the base directory, but it must point to the correct file name of the TDE. But it’s not like the name there needs to be anything specific.

Filtering Extracts

Another use case, perhaps even more common, would be to generate individually filtered TDEs, one per site. Imagine you want to create one extract per customer, and each customer has a site. The site_id you know at publish time; and in your data, you have a column with the various customer_ids / site_ids. If you could set the value of a filter in the TDS at publish time, you could put real data security on programmatically, and also dramatically decrease the size of your extracts. In fact, the filters on a extract are set exactly the same was as for a data source filter:


<extract count="-1" enabled="true" units="records">
<connection class="dataengine" dbname="Data/Datasources/extract.tde" schema="Extract" tablename="Extract" update-time="06/28/2016 11:36:15 AM">
<relation name="Extract" table="[Extract].[Extract]" type="table"/>
<calculations>
<calculation column="[Number of Records]" formula="1"/>
</calculations>
<refresh increment-key="" incremental-updates="false"/>
</connection>
<filter class="categorical" column="[region]">
<groupfilter user:ui-domain="database" user:ui-enumeration="inclusive" user:ui-marker="enumerate" function="union">
<groupfilter function="member" level="[region]" member="&quot;East&quot;"/>
<groupfilter function="member" level="[region]" member="&quot;West&quot;"/>
</groupfilter>
</filter>
</extract>

However there is one wrinkle in this plan: When Desktop or Server opens up a TDE, it does check for the existence of fields with the same name as the columns that are specified in the filter tags. So you can’t have a single TDE file that you use for everything; if you want to put filters on, you need a TDE that at least includes the fields that are mentioned in the filters. Luckily, those are the ONLY fields that need to exist.

Solution: The Tableau Extract API, part of the existing Tableau SDK. You can define an extract with any set of fields programmatically using the Extract API. You don’t have to insert any data; you simply define an extract with the fields that you are going to filter on, save that file, then include it in the TDSX file you are making. When the refresh is triggered, the filters will maintain (because the fields they were on existed in the original); they are taken into account during extract generation, so you are left with a fully built extract that matches the TDS and then those fields continue to exist going forward.

 

Advertisements

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s