Azure Databricks Concepts

This topic introduces the set of fundamental concepts you need to understand in order to use Azure Databricks effectively.

Workspace

A Workspace is an environment for accessing all of your Azure Databricks assets. A Workspace organizes notebooks, libraries, dashboards, and experiments into folders and provides access to data objects and computational resources.

This section describes the objects contained in the Azure Databricks Workspace folders.

Notebook

A web-based interface to documents that contain runnable commands, visualizations, and narrative text.

Command
Code that runs in a notebook. A command operates on files and tables. Commands can be run in sequence, referring to the output of one or more previously run commands.
Visualization
A graphical rendering of table data and and the output of notebook commands.
Archive
A package of notebooks that can be exported from and imported into Azure Databricks.
Dashboard
An interface that provides organized access to visualizations.
Library
A package of code available to the notebook or job running on your cluster. Databricks runtimes include many libraries and you can add your own.
Experiment
A collection of MLflow runs for training a machine learning model.
Databricks File System (DBFS)
A filesystem abstraction layer over a blob store. It contains directories, which can contain files (data files, libraries, and images), and other directories. DBFS is automatically populated with some datasets that you can use to learn Azure Databricks.

Interaction Management

This section describes the methods that Azure Databricks supports for interacting with all of your assets.

UI

The Azure Databricks UI provides an easy-to-use graphical interface to Workspace folders and their contained objects, data objects, and computational resources.

../_images/landing-azure.png
API
You can interact with Azure Databricks assets using the REST API. There are two versions of the API: REST API 2.0 and REST API 1.2. The REST API 2.0 supports most of the functionality of the REST API 1.2, as well as additional functionality and is preferred.
CLI
The Databricks command-line interface (CLI) provides an easy-to-use interface to the Azure Databricks platform. The open source project is hosted on GitHub. The CLI is built on top of the REST API 2.0.

Data Management

This section describes the objects that hold the data on which you perform analytics and feed into machine learning algorithms.

Database
A collection of information that is organized so that it can be easily accessed, managed, and updated.
Table
A representation of structured data. You query tables with Spark SQL and Apache Spark APIs. A table typically consists of multiple partitions.
Partition
A portion of a table. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.
Metastore
The component that stores all the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored.

Computation Management

This section describes concepts that you need to know to run computations in Azure Databricks.

Cluster

A set of computation resources and configurations on which you run notebooks and jobs. There are two types of clusters: interactive and job.

  • You create an interactive cluster using the UI, CLI, or REST API. You can manually terminate and restart an interactive cluster. Multiple users can share such clusters to do collaborative interactive analysis.
  • The Azure Databricks job scheduler creates a job cluster when you run a job on a new cluster and terminates the cluster when the job is complete. You cannot restart a job cluster.
Databricks runtime

The set of core components that run on the clusters managed by Azure Databricks. Azure Databricks offers several types of runtimes:

  • Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
  • Databricks Runtime with Conda an experimental Databricks runtime based on Conda. Databricks Runtime with Conda provides an updated and optimized list of default packages and a flexible Python environment for advanced users who require maximum control over packages and environments.
  • Databricks Runtime for Machine Learning is built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science. It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
  • Databricks Runtime for Health and Life Sciences is a version of Databricks Runtime optimized for working with genomic and biomedical data.
  • Databricks Light is the Azure Databricks packaging of the open source Apache Spark runtime. It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by Databricks Runtime. You can select Databricks Light only when you create a cluster to run a JAR, Python, or spark-submit job; you cannot select this runtime for clusters on which you run interactive or notebook job workloads.
Job
A non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis.
Workloads

Azure Databricks identifies two types of workloads subject to different pricing schemes: data engineering (aka automated) and data analytics (aka interactive).

  • Data engineering (aka automated) workloads run on job clusters which the Azure Databricks job scheduler creates for each workload.
  • Data analytics (aka interactive) workloads run on interactive clusters. Interactive workloads typically run commands within an Azure Databricks notebook. However, running a job on an existing cluster is also treated as an interactive workload.
Execution context
The state for a REPL environment for each supported programming language. The languages supported are Python, R, Scala, and SQL.

Model Management

This section describes concepts that you need to know to train machine learning models.

Model
A set of known dimensions that serves as the framework for training machines to make predictions. The initial structure imposed upon a function.
Trained Model
The outcome of the training process. A mathematical mapping from input to output.
Run
A collection of parameters, metrics, and tags related to training a machine learning model.

Authentication and Authorization

This section describes concepts that you need to know when you manage Azure Databricks users and their access to Azure Databricks assets.

User
A unique individual who has access to the system.
Group
A collection of users.
Access control list (ACL)
A list of permissions attached to the Workspace, cluster, job, table, or experiment. An ACL specifies which users or system processes are granted access to the objects, as well as what operations are allowed on the assets. Each entry in a typical ACL specifies a subject and an operation.