The Tableau Extract API 2.0 is an amazingly powerful tool for building out Extracts that, for whatever reason, cannot be built or maintained using the standard Tableau Server extract refresh process. The output of the Extract API 2.0 is a Hyper file (just as the older Extract API pushed out TDE files). You can publish a Hyper file directly to a Tableau Server, but there are several drawbacks:
- Tableau Server will build out an automatic TDS file, taking a rough guess at any type of metadata categorization (Measure vs. Dimensions, Hierarchies, Geographic info, etc.)
- The only use for this data source will be creating Ad Hoc reports using Web Edit (or hoping someone in Desktop now knows that it exists). You can’t integrate it easily in an existing Workbook
What is missing is a TDS file to pair up with the Hyper file, describing the exact metadata that you want to go along with the Extracted data. In this article, I’ll describe two workflows that result in a fully controlled TDSX file with a newly generated Hyper file.
One of the least mentioned, but incredibly useful APIs in Tableau is the Extract API, which allows you to programmatically create an Extract file (Hyper files starting in 10.5, previously TDE files). The main use case is for data sources that require programmatic access (as opposed to using the one of the native connectors in Tableau). Some situations where this would be useful:
- Data coming from a Web Service/ RESTful API with an object response
- ODBC / JDBC drivers that Tableau cannot use
- Additional programmatic modeling / statistical analysis against a whole data set
This post is focused mostly on first use case, where you are trying to make data available from some type of Web Service / RESTful API. In particular, if you need to provide only a subset from a very flexible set of possible fields for “ad hoc” analysis, this technique is the most functional solution to the problem.
When should I build a Flexible Extract Generator?
- Know the structure of your web service responses
- The amount of total fields is reasonably sized
- The web service responses will not change frequently
- Workbooks are fully built out and will not allow web editing
- Data Source structure can be reused across multiple reports (and possibly customers)
then the better solution for Web Service/REST API based data sources is “Live” Web Services Connections in Tableau.
If instead you want to provide a selection screen to generate an Extract that will power a Web Edit session, then it makes sense to build a Flexible Extract Generator process. This is particularly useful when the set of fields could change drastically from extract to extract, or if other processing (such as machine learning) needs to be applied based on differing parameters prior to its use by the end user (that said, if the actual output columns are consistent, the “Live” Web Services solution could still work).
The currently available Beta 1 of Tableau 2018.3 includes a long-requested feature for creating multiple table Hyper extracts — that is to say, each table you see in the connection pane will be brought in and stored as separate tables in a single Hyper extract file. Why is this so exciting? Because it’s the end of the need for Defusing Row Level Security in Tableau Data Extracts (Before They Blow Up) Part 1 (and Part 2)!
Starting in 2018.3
- The design for row level security will be the same in both live connections and extracts
- Extract files with security will create much faster
- Best practices for entitlements tables are now feasible in Extracts
Let’s dig into the essentials and how we can make this work for effective Row Level Security.
In this post, I’ll be describing a set of steps to follow to isolate the causes of performance issues on Tableau Server.
Here are the basic steps:
- Test the workbook in Tableau Desktop. Does it perform well? If yes:
- Test the workbook in Tableau Desktop on the Tableau Server machine. Does it perform the same as it did on the previous machine? If yes:
- Publish the workbook to Tableau Server, and find a time when there is low-to-no usage on the Tableau Server. Go to the published workbook. Did it perform relatively the same as the test in Step 2 (within 1-3 seconds)? If yes:
- Test the workbook during a time of high usage on the Tableau Server (either natural or do load testing using TabJolt).
Have you heard this one before? “Just connect to your data in Tableau and start visualizing. Then you’ll publish and share with your whole organization.” It’s a great line, because it’s true. You CAN get started with analysis on top of just about any data in Tableau. But “can” is not “should” — what is possible may not be the BEST way, particularly if you want to scale up. When dealing with massive amounts of data, a better solution is to have two data sources: (1) A pre-aggregated data set for overviews, which I’ll call the Overview data source (2) The row-level data set, which I’ll call the Granular data source. Tableau’s abilities to filter between two data sources (actions & cross-datasource filters in Tableau 10) make this an excellent strategy, and one that I have seen massively improve performance over and over.
Tableau Server, particularly since the 9.0 release has fantastic caching mechanism. Once a view has been loaded into the cache, any subsequent view using the same data will load extremely quickly. This is why you may notice that a first view in the morning takes some amount of time to load, but every other view is much quicker. Some Tableau customers even “warm” the cache on some of their views by scheduling an e-mail or pinging the Tableau Server for a request of a PDF early in the morning, before any of the regular viewer come in. You can even force a refresh using the “warming” technique by appending the :refresh parameter to the end of your request.
When you are trying to maximize performance in Tableau, particularly on a live connection, sometimes the smallest changes can make a big difference. All of your choices in Tableau Desktop eventually end up as a real live SQL query, which the database will have to interpret. The simpler the query, the easier the interpretation, and in most cases the quicker the results.
Tableau’s Dashboard Actions are amazing, and in the newer versions there is a quick little “Use as filter” button on each sheet in a Dashboard. This creates an Action in the Dashboard->Actions menu which is set to “All Fields” down at the bottom. This is incredibly convenient from a creation standpoint; however, it means that the selected values for every single dimension in the Source Sheet will be passed along as filters in the WHERE clause of the eventual SQL query. This includes categorical information which you are displaying: if you are showing Product Category, Product Sub-Category, and Product ID; all three will be sent in the eventual query.
Particularly when you are getting down to granular details, you really only need the most granular piece of information to be passed into the WHERE clause. For optimal performance, you really only want to pass in values for fields that are indexed in the database. In the previous example, presuming that a Product ID can only belong to one Category and Sub-Category, setting the Action to “Selected Fields” and choosing “Product ID” would simplify the query sent; hopefully Product ID is indexed and thus you get an incredibly quick lookup.