Migrating Single Node Workloads to Azure Databricks

This topic answers typical questions that come up when you migrate single node workloads to Databricks.

I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going wrong?
If you are working with any single-node libraries, they will not inherently become distributed when you switch to using Azure Databricks. You will need to re-write your code using PySpark and consider installing koalas.
I see there is MLLib and SparkML. What is the difference?
MLLib is the RDD-based API, while SparkML is the DataFrame-based API. We recommend you use SparkML as all active development is focused on SparkML. However, sometimes people use the term MLLib to more generically refer to the distributed ML libraries for Spark.
There is an algorithm in sklearn that I love, but SparkML doesn’t support it (such as DBSCAN). What are my alternatives?
Spark-sklearn. See the spark_sklearn documentation.
What are my deployment options for SparkML?
Why aren’t my matplotlib images displaying?
You must wrap any figures inside of the display() function. See Matplotlib and ggplot in Python Notebooks.
How can I install or upgrade my pandas or <library-name> version?

There are a few options:

  • Use the Azure Databricks library UI or API. These install the library on every node in the cluster.

  • Use Library utilities.

  • %sh /databricks/python/bin/pip install to install the library.

    This command installs the library on the driver only, and not the workers. Also, if you restart the cluster, this library will be removed.

Why does %sh pip install <library-name> install the Python 2 version, even if I’m running on a Python 3 cluster?
The default pip is for Python 2, so you must use %sh /databricks/python/bin/pip to use Python 3.
How can I view data on DBFS with just the driver?
Add /dbfs/ to the beginning of the file path. See Local file APIs.
How can I get data into Azure Databricks?