TensorFlow is an open-source framework for Machine Learning intelligence created by Google. It supports deep-learning and general numerical computations on CPUs, GPUs, and clusters of GPUs. It is subject to the terms and conditions of the Apache 2.0 License.
In the sections below, we provide guidance on installing TensorFlow on Azure Databricks and give an example of running TensorFlow programs. See Integrating Deep Learning Libraries with Apache Spark for an example of integrating a deep learning library with Spark.
This guide is not a comprehensive guide on TensorFlow. Refer to the TensorFlow website.
- On CPUs, use the
- On GPUs, use the
TensorFlow is included in Databricks Runtime ML, a machine learning runtime that provides a ready-to-go environment for machine learning and data science. Instead of manually installing TensorFlow, you can create a cluster using Databricks Runtime ML. For details, see Databricks Runtime 4.1 ML (Beta).
To test and migrate single-machine TensorFlow workflows, you can start with a driver-only cluster on Azure Databricks by setting the number of workers to zero. Though Apache Spark is not functional under this setting, it is a cost-effective way to run single-machine TensorFlow workflows. This example shows how you can run TensorFlow.
spark-tensorflow-connector is a library within the TensorFlow ecosystem that enables conversion between Spark DataFrames and TFRecords (a popular format for storing data for TensorFlow). With spark-tensorflow-connector, you can use Spark DataFrame APIs to read TFRecords files into DataFrames and write DataFrames as TFRecords.
The spark-tensorflow-connector library is included in Databricks Runtime ML (Beta), a machine learning runtime that provides a ready-to-go environment for machine learning and data science. Instead of installing the library using the instructions below, you can simply create a cluster using Databricks Runtime ML. See Databricks Runtime 4.1 ML (Beta).
To use spark-tensorflow-connector on Azure Databricks, you’ll need to build the project JAR locally, upload it to Azure Databricks, and attach it to your cluster as a library.
Ensure you have Maven in your PATH (see the Maven installation instructions if needed).
Clone the TensorFlow ecosystem repository and cd into the
git clone https://github.com/tensorflow/ecosystem cd ecosystem/spark/spark-tensorflow-connector
Follow the instructions in the README to build the project locally. For the build to succeed, you may need to modify the test configuration so that tests run serially. You can do this by adding a
<configuration>tag to the
<configuration> <parallel>false</parallel> </configuration>
The build command prints the path of the spark-tensorflow-connector JAR, for example:
Installing /Users/<yourusername>/ecosystem/spark/spark-tensorflow-connector/target/spark-tensorflow-connector_2.11-1.6.0.jar to /Users/<yourusername>/.m2/repository/org/tensorflow/spark-tensorflow-connector_2.11/1.6.0/spark-tensorflow-connector_2.11-1.6.0.jar
Upload this JAR to Azure Databricks as a library and attach it to your cluster. You should now be able to run the example notebook (adapted from the spark-tensorflow-connector usage examples):