Connecting your Azure Databricks Workspace to your On-Premises Network

This is a high-level guide on how to establish connectivity from your Azure Databricks Workspace to your on-premises network. It is based on the hub-and-spoke topology shown in the following diagram, where traffic is routed via a transit virtual network (VNet) to the on-premises network.

This process requires your Azure Databricks workspace to be deployed in your own virtual network (also known as VNet injection).

../../../_images/azure-networking-transit-vnet.png

Note

Don’t hesitate to reach out to your Microsoft and Databricks account teams to discuss the configuration process described in this topic.

Prerequisites

Your Azure Databricks workspace must be deployed in your own virtual network.

Step 1: Set up a transit virtual network with Azure Virtual Network Gateway

On-premises connectivity requires a Virtual Network Gateway (ExpressRoute or VPN) in a transit VNet. Skip to step 2 if one already exists.

  • If you already have ExpressRoute set up between your on-premises network and Azure, ensure that a virtual network gateway is set up in a transit VNet, as described in this Microsoft Azure document.
  • If you do not have ExpressRoute set up, follow steps 1-5 in this Microsoft Azure document to create a transit VNet with a VPN-based virtual network gateway.

Note

For assistance, contact your Microsoft account team.

Step 2: Peer the Azure Databricks virtual network with the transit virtual network

Follow the instructions in Virtual Network Peering to peer the Azure Databricks VNet to the transit VNet, selecting the following options:

  • Use Remote Gateways on the Azure Databricks VNet side.
  • Allow Gateway Transit on the Transit VNet side.

You can learn more about these options in this Microsoft Azure document.

Note

If your on-premises network connection to Azure Databricks does not work with the above settings, you can also select the Allow Forwarded Traffic option on both sides of the peering to resolve the issue.

For more information about configuring VPN gateway transit for virtual network peering, see this Microsoft Azure document.

Step 3: Create user-defined routes and associate them with your Azure Databricks virtual network subnets

Once the Azure Databricks VNet is peered with the transit VNet (using the virtual network gateway), Azure auto-configures all routes via the transit VNet. This could start breaking cluster setup within the Azure Databricks workspace, because a properly configured return route from cluster nodes to the Azure Databricks control plane could be missing. You must therefore create user-defined routes (also known as UDR or custom routes).

  1. Create a route table using the instructions in this Microsoft Azure document.

    When you create the route table, enable BGP route propagation.

    Note

    If your on-premises network connection setup fails during testing, you might need to disabled the BGP route propagation option. Disable as a last resort only.

  2. Add user-defined routes for the following services, using the instructions in this Microsoft Azure document.

    Source Address prefixes Next hop type
    Default Control Plane NAT IP Internet
    Default Webapp IP Internet
    Default Metastore IP Internet
    Default Artifact Blob Storage IP Internet
    Default Log Blob Storage IP Internet
    Default DBFS root Blob Storage IP Internet

    To get the IP addresses for each of these services, follow the instructions in User-Defined Route Settings for Azure Databricks.

  3. Associate the route table with your Azure Databricks VNet public and private subnets, using the instructions in this Microsoft Azure document.

    Once the custom route table has been associated with your Azure Databricks VNet subnets, it is unnecessary to edit the outbound security rules in the network security group. You could choose to change the outbound rule for “Internet / Any” to be more specific, but there is no real reason to, because the routes will control the actual egress.

Note

If the IP-based route fails during testing, you can create a service endpoint for Microsoft.Storage, such that all Blob Storage traffic goes through the Azure backbone. You can also take this approach instead of creating user-defined routes for Blob Storage.

Note

If you want to access other PaaS Azure Data Services from Azure Databricks, like CosmosDB, SQL Data Warehouse and others, you must add user-defined routes to the route table for those services as well. Resolve each endpoint to its IP address using nslookup or an equivalent command.

Step 4: Validate the setup

To validate the setup:

  1. Create a cluster in your Azure Databricks workspace.

    If this fails, go through the instructions in Steps 1-3 again, trying the alternate configurations mentioned in the notes.

    If you still cannot create a cluster, check to see if the route table has all of the required user-defined routes. If you used service endpoints rather than user-defined routes for Blob Storage, check that configuration as well.

    If that fails, reach out to your Microsoft and Databricks account teams for assistance.

  2. Once the cluster is created, try connecting to an on-premises VM by doing a simple ping from a notebook using %sh.

    If the ping is unsuccessful, verify the effective routes for the cluster nodes to see if a combination of custom and default routes is being applied incorrectly. Fix if needed.

The following guidance can also be helpful when troubleshooting:

Option: Route Azure Databricks traffic using a virtual appliance or firewall

You might also want to filter or vet all outgoing traffic from Azure Databricks cluster nodes using a firewall or DLP appliance (such as Azure Firewall, Palo Alto, Barracuda, and so forth). This may be required to:

  • Satisfy enterprise security policies that mandate all outgoing traffic to be “inspected” and allowed or denied as configured.
  • Get a single NAT-like public IP or CIDR for all Azure Databricks clusters, which could be configured in a whitelist for any data source.

To set up filtering this way, follow the instructions in Steps 1-4, with some additional steps. What follows is just for reference; details may vary by firewall appliance:

  1. Set up a virtual appliance or firewall within the transit VNet, using the instructions in this Microsoft Azure document.

    Alternatively, you can create the firewall in a secure or DMZ subnet within the Azure Databricks VNet, which is separate from existing private and public subnets. The transit VNet solution is recommended as your hub-spoke topology, however, if you require a firewall setup for multiple workspaces.

  2. Create an additional route in the custom route table to 0.0.0.0/0 with Next hop type as “Virtual Appliance” and appropriate Next hop address.

    The routes configured in Step 3 should remain, although routes (or service endpoints) for Blob Storage can be removed if all Blob Storage traffic needs to be routed via the firewall.

    If you use the secure or DMZ subnet approach, you may need to create an additional route table to be associated with that subnet only. That route table should have a route to 0.0.0.0 with Next hop type as “Internet” or “Virtual network gateway”, depending on whether traffic is destined for a public network directly or via an on-premises network.

  3. Configure allow and deny rules in the firewall appliance.

    If you have removed the routes for Blob Storage, those routes should be whitelisted in the firewall.

    You might also need to whitelist certain public repositories (such as for Ubuntu and so forth), to ensure that clusters are created correctly.

    For whitelisting information, see User-Defined Route Settings for Azure Databricks.

Option: Configure custom DNS

If you want to use your own DNS for name resolution, you can do that with Azure Databricks workspaces deployed in your own virtual network. See the following Microsoft documents for more information about how to configure custom DNS for an Azure virtual network: