This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Azure Databricks. Azure Databricks provides these examples on a best-effort basis. Because they are external libraries, they may change in ways that are not easy to predict. If you need additional support for third-party tools, consult the documentation, mailing lists, forums, or other support options provided by the library vendor or maintainer.
The instruction to
Open H2O Flow in browser http://<ipaddress>:54321 (CMD + click in Mac OSX) that appears after connecting to the H2O server requires ssh access to the cluster and is not supported on Azure Databricks.
scikit-learn, a well-known Python machine learning library, is included in Databricks Runtime. See Databricks Runtime Release Notes for the scikit-learn library version included with your cluster’s runtime.
The DataRobot modeling engine is a commercial product that supports massively parallel modeling applications, building and optimizing models of many different types, and evaluating and ranking their relative performance. This modeling engine exists in a variety of implementations, some cloud-based, accessed via the Internet, and others residing in customer-specific on-premises computing environments. Read more at DataRobot.
XGBoost is a popular machine learning library designed specifically for training decision trees and random forests. You can train XGBoost models on individual machines or in a distributed fashion. Read more in the XGBoost documentation.
There are two versions of XGBoost: a Python version, which is not distributed, and a Scala-based Spark version, which supports distributed training.
To install the non-distributed Python version, run:
/databricks/python/bin/pip install xgboost --pre
This Python version allows you to train only single node workloads.
XGBoost is included in Databricks Runtime ML, a machine learning runtime that provides a ready-to-go environment for machine learning and data science. Instead of installing XGBoost using the instructions below, you can simply create a cluster using Databricks Runtime ML. See Overview of Databricks Runtime for Machine Learning.
You install XGBoost as a Databricks library, using
xgboost-linux64 as the Spark Package name.