How to Run SQL Queries from Python Scripts

You may want to have access to your tables outside Azure Databricks notebooks. Besides connecting BI tools via JDBC (Business Intelligence Tools) and using the curl tool (Connecting BI Tools), you can also access tables from Python scripts. One way is to connect to a Spark cluster via JDBC using PyHive. You should have PyHive installed on the machine where you are running the Python script.

  1. Install PyHive and Thrift for Python 2:

    pip2 install pyhive thrift
    
  2. In the following example, replace <token> with your Azure Databricks API token, <databricks-instance> with the domain name of your Databricks deployment, <workspace-id> with the Workspace ID, and <cluster-id> with a cluster ID. To get the API token, see Generate a token. To determine the other values, see How to get Workspace, Cluster, Notebook, and Job Details.

Example

Here is sample code that sends a SQL query show tables and prints out the result of the query.

#!/usr/bin/python

import os
import sys
from pyhive import hive
from thrift.transport import THttpClient
import base64

TOKEN = "<token>"
WORKSPACE_URL = "<databricks-instance>"
WORKSPACE_ID = "<workspace-id>"
CLUSTER_ID = "<cluster-id>"

conn = 'https://%s/sql/protocolv1/o/%s/%s' % (WORKSPACE_URL, WORKSPACE_ID, CLUSTER_ID)
print(conn)

transport = THttpClient.THttpClient(conn)
transport.setCustomHeaders({'Authorization': 'Basic {0}'.format(base64.standard_b64encode('token:{0}'.format(TOKEN).encode()))})

cursor = hive.connect(thrift_transport=transport).cursor()

cursor.execute('show tables')
for table in cursor.fetchall():
    print(table)