DNASeq Pipeline

The Azure Databricks DNASeq pipeline is a GATK best practices compliant pipeline for short read alignment, variant calling, and variant annotation.

Important

The Azure Databricks DNASeq pipeline is currently in Beta. Interfaces and pricing are subject to change before general availability.

We recommend running the DNASeq pipeline as an Azure Databricks job. When run interactively, you are charged per DBU as well as per giga base pair.

Setup

The pipeline is run as an Azure Databricks job. Most likely, an Azure Databricks employee will set up the initial job for you. The necessary details are:

  • The cluster configuration should use the Databricks Runtime HLS.
  • The task should be the DNASeq notebook found at the bottom of this page.
  • For best performance, use the compute optimized VMs with at least 60GB of memory. We recommend Standard_D16s_v3 VMs.
DNASeq job

Parameters

The pipeline accepts a number of parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the DNASeq notebook. All parameters can be set for all runs or per-run.

Parameter Default Description
manifest n/a The path of the manifest file describing the input.
output n/a The path where pipeline output should be written.
exportVCF false If true, the pipeline writes results in VCF as well as Parquet.
referenceConfidenceMode NONE
  • If NONE, only variant sites are included in the output.
  • If GVCF, all sites are included, with adjacent reference sites banded.
  • If BP_RESOLUTION, all sites are included.

In addition, you can configure the reference genome using environment variables. By default, the pipeline runs against Grch37. To use Grch38 instead, set an environment variable like this:

refGenomeId=grch38

Manifest Format

The manifest is a CSV file describing where to find the input FASTQ or BAM files. An example:

file_path,sample_id,paired_end,read_group_id
*_R1_*.fastq.bgz,HG001,1,read_group
*_R2_*.fastq.bgz,HG001,2,read_group

If your input consists of unaligned BAM files, you should omit the paired_end field:

file_path,sample_id,paired_end,read_group_id
*.bam,HG001,,read_group

Tip

The file_path field in each row may be an absolute path or a path relative to the manifest. You can include globs (*) to match many files.

Output

The aligned reads, called variants, and annotated variants are all written out to Parquet tables inside the provided output directory. Each table is partitioned by sample ID. In addition, if you configured the pipeline to export VCFs or GVCFs, they’ll appear under the output directory as well.

output
|---alignments
    |---sampleId=HG001
        |---Parquet files
    |---sampleId=HG002
|---annotations
    |---sampleId=HG001
        |---Parquet files
|---annotations.vcf
    |---sampleId=HG001
        |---HG001.vcf
|---genotypes
    |---sampleId=HG001
        |---Parquet files
|---genotypes.vcf
    |---sampleId=HG001
        |---HG001.g.vcf

When you run the pipeline on a new sample, it’ll appear as a new partition. If you run the pipeline for a sample that already appears in the output directory, that partition will be overwritten.

Since all the information is available in Parquet, you can easily analyze it with Spark in SQL, Scala, Python, or R. For example:

# Load the data
df = spark.read.parquet("/genomics/output_dir/genotypes")
# Show all variants from chromosome 12
display(df.where("contigName == '12'").orderBy("sampleId", "start"))
-- Register the table in the catalog
CREATE TABLE genotypes
USING PARQUET
LOCATION '/genomics/output_dir/genotypes'

Running programmatically

In addition to using the UI, you can start runs of the pipeline programmatically using the Databricks CLI.

Find the job id

After setting up the pipeline job in the UI, copy the job ID as you pass it to the jobs run-now CLI command.

Here’s an example bash script that you can adapt for your workflow:

# Generate a manifest file
cat <<HERE >manifest.csv
file_path,sample_id,paired_end,read_group_id
dbfs:/genomics/my_new_sample/*_R1_*.fastq.bgz,my_new_sample,1,read_group
dbfs:/genomics/my_new_sample/*_R2_*.fastq.bgz,my_new_sample,2,read_group
HERE

# Upload the file to DBFS
DBFS_PATH=dbfs:/genomics/manifests/$(date +"%Y-%m-%dT%H-%M-%S")-manifest.csv
databricks fs cp index.rst $DBFS_PATH

# Kick off a new run
databricks jobs run-now --job-id <job-id> --notebook-params "{\"manifest\": \"$DBFS_PATH\"}"

In addition to starting runs from the command line, you can use this pattern to invoke the pipeline from automated systems like Jenkins.