Jobs

A job is a way of running a notebook or JAR either immediately or on a scheduled basis. You can create and run jobs using the UI, the CLI, and by invoking the Jobs API. Similarly, you can monitor job run results in the UI, using the CLI, by querying the API, and through email alerts. This topic focuses on performing job tasks using the UI. For the other methods, see Databricks CLI and Jobs API.

Important

The number of jobs is limited to 1000.

View jobs

Click the Jobs icon Jobs Menu in the sidebar. The Jobs list displays. The Jobs page lists all defined jobs, the cluster definition, the schedule if any, and the result of the last run.

In the Jobs list, you can filter jobs:

  • Using key words.
  • Selecting only jobs you own or jobs you have access to. Access to this filter depends on Job Access Control being enabled.

You can also click any column header to sort the list of jobs (either descending or ascending) by that column. By default, the page is sorted on job names in ascending order.

Job List

Create a job

  1. Click + Create Job. The job detail page displays.

    Job Conf

  2. Enter a name in the text field with the placeholder text Untitled.

  3. Specify the job properties:

    • The notebook or JAR to run.

      Note

      There are some significant differences between running notebook and JAR jobs. See JAR job tips for more information.

    • Job parameters. Click Edit next to Parameters to add the parameters, either as key-value pairs or JSON object.

    • Dependent libraries. Click Add next to Dependent Libraries. The dependent libraries are automatically attached to the cluster on launch. Follow the recommendations in Library dependencies for specifying dependencies.

    • The cluster the job will run on: you can launch a cluster when the job is run or select a cluster that already exists.

      Note

      • There is a tradeoff between running on an existing cluster and a new cluster. We recommend running a new cluster for production-level jobs or jobs that are important to complete. Existing clusters work best for tasks such as updating dashboards at regular intervals.

      • New in version 2.71: If you select a terminated cluster, and the job owner has Can Restart permission, the cluster is started when the job is scheduled to run.

    • Optional spark-submit parameters. Click Configure spark-submit to open the Set Parameters dialog, where you can enter spark-submit parameters as a JSON array.

      Set Parameters

View job details

On the Jobs page, click a job name in the Name column. The job details page shows configuration parameters, active runs, and completed runs.

../_images/job-details.png

Run a job

You can run a job on a schedule or immediately.

  • To define a schedule for the job, click Edit next to Schedule.

    ../_images/job-schedule.png

    The Azure Databricks job scheduler, like the Spark batch interface, is not intended for low latency jobs. Due to network or cloud issues, job runs may occasionally be delayed up to several minutes. In these situations, scheduled jobs will run immediately on service availability.

  • To run the job immediately, in the Active runs table click Run Now.

    ../_images/run-option.png

    Tip

    Click Run Now to do a test run of your notebook or JAR when you’ve finished configuring your job. If your notebook fails, you can edit it and the job will automatically run the new version of the notebook.

Run a notebook job with different parameters

  1. In the Active runs table, click Run Now with Different Parameters.

    Run Now With Different Params

  2. Specify the parameters. The provided parameters are merged with the default parameters for the triggered run. If you delete keys, the default parameters are used.

  3. Click Run.

JAR job tips

There are some caveats you need to be aware of when you run a JAR job.

Use the shared SparkContext

Because Databricks is a managed service, some code changes may be necessary to ensure that your Apache Spark jobs run correctly. JAR job programs must use the shared SparkContext API to get the SparkContext. Because Databricks initializes the SparkContext, programs that invoke new SparkContext() will fail. To get the SparkContext, use only the shared SparkContext created by Databricks:

val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()

Warning

There are several methods you must avoid when using the shared SparkContext.

  • Do not manually create a SparkContext using the constructor:

    import org.apache.spark.SparkConf
    val badSparkContext = new SparkContext(new SparkConf().setAppName("My Spark Job").setMaster("local"))
    
  • Do not stop SparkContext inside your JAR:

    val dontStopTheSparkContext = SparkContext.getOrCreate()
    dontStopTheSparkContext.stop()
    
  • Do not call System.exit(0) or sc.stop() at the end of your Main program. This can cause undefined behavior.

Configure JAR job parameters

JAR jobs are parameterized with an array of strings. In the UI, you input the parameters in the Arguments text box which are split into an array by applying POSIX shell parsing rules. For more information, reference the shlex documentation. In the API, you input the parameters as a standard JSON array. For more information, reference SparkJarTask. To access these parameters, inspect the String array passed into your main function.

View a job run

Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, we recommend that you export job run results before they expire. For more information, see Export job run results.

To view job run information, in the Jobs page, click the job name in the Name column.

../_images/job-run-list.png

In the job run page, you can view the standard error, standard output, log4j output for the job run by clicking the Logs link in the Spark column.

In the job run page, click the run number in the Run column of the Completed in past 60 days table to see the relevant details and job output.

Export job run results

You can export notebook run results and job run logs for all job types.

Export notebook run results

You can persist job runs by exporting their results. For notebook job runs, you can export a rendered notebook which can be later be imported into your Databricks workspace.

  1. In the job detail page, click a job run name in the Run column.

    ../_images/job-run.png
  2. Click Export to HTML.

    ../_images/export-notebook-run.png

Export job run logs

You can also export the logs for your job run. To automate this process, you can set up your job so that it automatically delivers logs to DBFS through the Job API. For more information, see the NewCluster and ClusterLogConf fields in the Job Create API call.

Edit a job

To edit a job, click the job name link in the Jobs list.

Delete a job

To delete a job, click the x in the Action column in the Jobs list.

Library dependencies

The Spark driver for Databricks has certain library dependencies that cannot be overridden. These libraries will take priority over any of your own libraries that conflict with them.

To get the full list of the driver library dependencies, run the following command inside a notebook attached to a cluster of the same Spark version (or the cluster with the driver you want to examine).

%sh ls /databricks/jars

Manage library dependencies

A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and Hadoop as provided dependencies. On Maven, add Spark and/or Hadoop as provided dependencies as shown below.

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.10</artifactId>
  <version>1.5.0</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-core</artifactId>
  <version>1.2.1</version>
  <scope>provided</scope>
</dependency>

In sbt, add Spark and/or Hadoop as provided dependencies as shown below.

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
libraryDependencies += "org.apache.hadoop" %% "hadoop-core" % "1.2.1" % "provided"

Tip

Specify the correct Scala version for your dependencies based on the version you are running.

Job options

The other options that you can specify for a job include:

Alerts
Email alerts sent in case of job failure, success, or timeout. See Job alerts.
Timeout
The maximum completion time for a job. If the job does not complete in this time, Databricks sets its status to “Timed Out”.
Retries

Policy that determines when and how many times failed runs are retried.

../_images/retry-policy.png
Maximum concurrent runs
The maximum number of runs that can be run in parallel. On starting a new run, Databricks skips the run if the job has already reached its maximum number of active runs. Set this value higher than the default of 1 if you want to be able to perform multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs that differ by their input parameters.

Job alerts

You can set up email alerts for job runs. On the job detail page, click Advanced and click Edit next to Alerts. You can send alerts up job start, job success, and job failure (including skipped jobs), providing multiple comma-separated email addresses for each alert type. You can also opt out of alerts for skipped job runs.

../_images/job-alerts.png

Integrate these email alerts with your favorite notification tools, including:

Control access to jobs

Job access control enable job owners and administrators to grant fine grained permissions on their jobs. With job access controls, job owners can choose which other users or groups can view results of the job. Owners can also choose who can manage runs of their job (that is, invoke Run Now and Cancel.)

See Jobs Access Control for details.