RNASeq Pipeline

The Databricks RNASeq pipeline handles short read alignment and quantification using STAR and ADAM.


The Azure Databricks RNASeq pipeline requires Databricks Runtime HLS, which is in Beta. Interfaces and pricing are subject to change before general availability.

We recommend running the RNASeq pipeline as an Azure Databricks job. When run interactively, you are charged per DBU as well as per giga base pair.


The pipeline is run as an Azure Databricks job. Most likely, an Azure Databricks employee will set up the initial job for you. The necessary details are:

  • The cluster configuration should use Databricks Runtime HLS.
  • The task should be the DNASeq notebook provided to you.
  • For best performance, use the compute optimized VMs with at least 60GB of memory. We recommend Standard_F32s_v2 VMs.


The pipeline accepts a number of parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the RNASeq notebook. All parameters can be set for all runs or per-run.

Parameter Default Description
manifest n/a The path of the manifest file describing the input.
output n/a The path where pipeline output should be written.

In addition, you must configure the reference genome using environment variables. To use Grch37, set the environment variable:


To use Grch38 instead, set an environment variable like this:



The pipeline consists of two steps:

  1. Alignment: Map each short read to the reference genome using the STAR aligner.
  2. Quantification: Count how many reads correspond to each reference transcript.

Additional usage info

The operational aspects of the RNASeq pipeline are very similar to the DNASeq pipeline. For more information about manifest format, output structure, and programmatic usage, see DNASeq Pipeline.