Third-Party Machine Learning Integrations

This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Azure Databricks. Azure Databricks provides these examples on a best-effort basis. Because they are external libraries, they may change in ways that are not easy to predict. If you need additional support for third-party tools, consult the documentation, mailing lists, forums, or other support options provided by the library vendor or maintainer.

H2O Sparkling Water

H2O is an open source project for distributed machine learning. This section describes how to integrate H2O using the Sparkling Water module.

Note

The instruction to Open H2O Flow in browser http://<ipaddress>:54321 (CMD + click in Mac OSX) that appears after connecting to the H2O server requires ssh access to the cluster and is not supported on Azure Databricks.

Python Notebook

Note

Databricks Runtime for Machine Learning installs XGBoost, which conflicts with the XGBoost packaged in PySparkling. To use PySparkling on Databricks Runtime ML, you must first remove XGBoost using this command:

rm /databricks/jars/spark--maven-trees--ml--xgboost*

scikit-learn

scikit-learn, a well-known Python machine learning library, is included in Databricks Runtime. See Databricks Runtime Release Notes for the scikit-learn library version included with your cluster’s runtime.

DataRobot

The DataRobot modeling engine is a commercial product that supports massively parallel modeling applications, building and optimizing models of many different types, and evaluating and ranking their relative performance. This modeling engine exists in a variety of implementations, some cloud-based, accessed via the Internet, and others residing in customer-specific on-premises computing environments. Read more at DataRobot.

XGBoost

XGBoost is a popular machine learning library designed specifically for training decision trees and random forests. You can train XGBoost models on individual machines or in a distributed fashion. Read more in the XGBoost documentation.

XGBoost versions

There are two versions of XGBoost: a Python version, which is not distributed, and a Scala-based Spark version, which supports distributed training.

Single node training

To install the non-distributed Python version, run:

/databricks/python/bin/pip install xgboost --pre

This Python version allows you to train only single node workloads.

Distributed training

In order to perform distributed training, you must use XGBoost’s Scala/Java packages.

Install XGBoost

Note

XGBoost is included in Databricks Runtime ML, a machine learning runtime that provides a ready-to-go environment for machine learning and data science. Instead of installing XGBoost using the instructions below, you can simply create a cluster using Databricks Runtime ML. See Overview of Databricks Runtime for Machine Learning.

You install XGBoost as a Databricks library, using xgboost-linux64 as the Spark Package name.