Visualize Data with Tableau¶
Tableau is a commercial tool for exploring and visualizing tabular data.
In addition to making visualizations, Tableau makes it easy to select, filter, group, and describe your data in tables. This can be easier than the same operations in a Python/R/SQL REPL with Spark.
If you have Tableau installed on your laptop, you can connect it to your Spark cluster on an HPC.
In addition to Tableau, you must install this ODBC driver on your laptop. It may require admin privileges.
Concepts¶
Configuring this workflow has some complexity, and so it is important that you understand what’s going on.
Components¶
Thrift Server: Connects SQL clients (like Tableau) to the Spark cluster over the network.
Hive Metastore: Created by Spark in your working directory on the Lustre filesystem.
PostgreSQL database: Database for the metastore that supports multiple Spark sessions.
Data sources: Parquet/CSV files on the shared filesystem.
Views: Read-only views into the data sources that get registered in the database.
Workflow¶
Start a Spark cluster with a Thrift Server using a PostgreSQL-based Hive metastore.
Create one or more views into your data sources. This will create the metastore in the current directory by default.
Connect with a SQL client and send queries.
Compute Node Instructions¶
Acquire one or more compute nodes. The rest of this section assumes that you are logged into the head node of the Slurm allocation.
Start the Spark cluster.
$ sparkctl configure --postgres-hive-metastore --thrift-server $ sparkctl start
Export the environment variables
SPARK_CONF_DIR
andJAVA_HOME
as shown on the console.Create one or more views. These examples show how to create one view on the command line and how to create several views in a batch process.
Suppose that your data sources are located at
/projects/my-project/data
.Create one view. (Note the backslashes that prevent the shell from acting on the backticks.)
$ spark-sql -e \ "CREATE VIEW my_view AS SELECT * FROM parquet.\`/projects/my-project/data/table1.parquet\`"
Create several views.
Add lines like these to a text file called
views.txt
:CREATE VIEW my_view1 AS SELECT * FROM parquet.`/projects/my-project/data/table1.parquet;` CREATE VIEW my_view2 AS SELECT * FROM parquet.`/projects/my-project/data/table2.parquet;` CREATE VIEW my_view3 AS SELECT * FROM parquet.`/projects/my-project/data/table3.parquet;`
Execute the commands with
spark-sql
.$ spark-sql -f views.txt
Client-side Instructions¶
Create an ssh tunnel from your laptop to the compute node running the Thrift Server (the head node in the Slurm allocation).
Replace the text below with your compute node and username.
$ ssh -L 10000:x3000c0s25b0n0:10000 jdoe@kestrel.hpc.nrel.gov
Open Tableau and connect to a new data source. Choose the
Spark SQL
connector installed above. It may be listed in theConnect
window underTo a Server
after selectingMore...
.A
Spark SQL
window will open. Choose these settings:Connection =
SparkThriftServer (Spark 1.1 and later)
Server =
localhost
Port =
10000
Authentication =
Username
Password: leave blank
Click
Sign In
. A Tableau workbook will open with a connection tolocalhost
.Select the
Schema
dropdown, typedefault
and click the plus button.A Table section will open. Click the search button next to
Enter table name
. Tableau should discover the tables in your database and list them. Double-click on one.
Persistent metastore and spark-warehouse¶
The default locations of the metastore and spark-warehouse are in your current directory.
You will notice the directories metastore_db
and spark-warehouse
.
If you have a set of static tables that you use frequently, you may want to create a persistent metastore and spark-warehouse. You can set these locations with a custom option.
$ sparkctl configure --metastore-dir <your-path>
You may notice that this updates the files conf/hive-site.xml
and conf/spark-defaults.conf
.
All tables and views that you create will be available to any cluster started this way.