THE PROCESS OF CROSS-DATABASE JOINS IN TABLEAU
Tableau 10.0 introduced a new feature called cross-database joins. In today’s world it is no surprise that data required for analysis resides in numerous sources that need multiple files and database manipulation. Prior to Tableau 10.0, performing data preparation required compiling data into one file or a database table before connecting to Tableau. The cross-database join feature has simplified the process of bringing data together for exploration and uncovering new insight.
In some cases, Tableau is able to solve many use cases using the data blending technique. However, there are instances where data blending does not perform effectively and data from different sources are brought into a single source to “join” for the desired results. With the introduction of cross-database joins, analysts can directly perform joins across multiple data sources from within Tableau. For example, joins can be established between a table in SQL Server and another table in an Oracle database, Excel file or even a flat file.
When joins are performed on tables residing in the same database, Tableau only receives the output of the joins directly from the database and does not execute the join queries. On the other hand, when the cross database join feature is active, Tableau cannot ask the database to carry out the join since the tables are from different databases. Instead, the entire join is executed by Tableau. The following steps were taken to investigate the process carried out by Tableau:
1. The Orders table from the Superstore Excel file were replicated into a table in an SQL Server database.
- Connection to an SQL Server database was first established in Tableau; a connection was then added in the same data source by selecting the Superstore Excel file. A single data source was created with two connections.
- The Orders table was dragged from the SQL Server database in the Connect pane to the canvas, followed by the Returns table from the Excel file. Tableau automatically defined the join by linking the Order ID field from both the tables.
4. A sheet/view with the following fields was created:
a. Dimensions
i. Region (from SQL Server – Orders table) ii. Category (from SQL Server – Orders table) iii. Returned (from Excel – Returns table)
b. Measures
i. Sales (from SQL Server – Orders table)
Performance Recorder was turned on while the view was accessed. Once the execution was complete, the results of the Performance Recorder were analyzed, yielding the following observations:
1. A query is sent to the SQL Server database with the following retrieved fields:
a. Region b. Category c. Order ID d. Sales
The Region, Category, and Sales fields shown in the view/sheet are added to the query.
The Order ID field as the linking field in the join is included in the query even though it is not part of the view/sheet.
The query is non-aggregated.
2. A second query is sent to the Excel file with the following retrieved fields:
a. Order ID b. Returned
The Returned field shown in the view/sheet is added to the query.
The Order ID field as the linking field in the join is included in the query even though it is not part of the view/sheet.
The query is non-aggregated.
- Tableau receives the results of both the queries from each respective database and stores the data locally. Although the method for storing data is unknown, the assumption is that Tableau creates temporary data extract files to store the results of queries. This method would prove to be additional workload where data is retrieved and temporary tables are populated.
- Tableau applies the join on the temporary datasets (saved locally) and creates the final dataset. Tableau uses the underlying data extract engine (TDE Server) to join the datasets. The Order ID linking field present in both of the results is used to establish the join between the datasets. This join query is aggregated.
Conclusion It is important to note that the execution speed when using cross-database joins may vary depending on the source type and size of the data. Retrieving data from multiple data sources, storing them locally, and then applying the join may be a viable option when working with smaller data sets. However, when dealing with larger volumes of data, defining a proper and efficient data strategy may be a better approach.
This Tableau Tip was written by:
Sourabh Dasgupta
Sourabh Dasgupta is the Director of Technology & BI at Corporate Renaissance Group India and is also the Product Manager of CRG’s in-house software products such as FlexABM and Cost Allocator. Over the past 14 years he has been involved as a technical expert in various projects and assignments related to implementation of Activity Based Costing and Business Intelligence and reporting systems in India as well as South Africa.
He is a one of the professional trainers for Tableau software and has conducted many sessions for clients. He has been involved in designing and development of Tableau reports and dashboards, and was awarded with a Tableau Desktop 8 Qualified Associate certificate as well as a Tableau Server 8 Qualified Associate certificate. He has successfully conducted training on the Tableau Software (Desktop and Server versions) for many clients, which included basic and advanced features of the Tableau Desktop and Tableau Server.
Sourabh has a Bachelor of Science degree (Computer Science, Physics, and Mathematics) from Nagpur University, India and a Diploma from the National Institute of Information Technology (NIIT), Pune, India. He also holds a Microsoft Certified Professional certificate for Designing and Implementing Desktop Applications with Microsoft Visual basic.
If you wish to accelerate your skills and empower your learning with hands-on education, or are seeking an Expert-on-Demand, reach out to Sourabh at sdasgupta@crgroup.com
Leave a Comments
You must be logged in to post a comment.