Joint Genotyping Pipeline

The Azure Databricks joint genotyping pipeline is a GATK best practices compliant pipeline for joint genotyping using GenotypeGVCFs.

Beta

The Azure Databricks joint genotyping pipeline requires Databricks Runtime HLS, which is in Beta. Interfaces and pricing are subject to change before general availability.

We recommend running the joint genotyping pipeline as an Azure Databricks job.

Walkthrough

The pipeline consists of the following steps:

  1. Ingest variants into Delta Lake.
  2. Joint-call the cohort with GenotypeGVCFs.

During variant ingest, single-sample gVCFs are processed in batches and the rows are stored in Delta Lake to provide fault tolerance, fast querying, and incremental joint genotyping. In the joint genotyping step, the gVCF rows are ingested from Delta Lake, split into bins, and distributed to partitions. For each variant site, the relevant gVCF rows per sample are identified and used for regenotyping.

Setup

The pipeline is run as an Azure Databricks job. Most likely an Azure Databricks solutions architect will work with you to set up the initial job. The necessary details are:

  • The cluster configuration should use Databricks Runtime HLS.
  • The task should be the joint genotyping pipeline notebook found at the bottom of this page.
  • For best performance, use the storage-optimized VMs. We recommend Standard_L32s_v2.

Parameters

The pipeline accepts parameters that control its behavior. The most important and commonly changed parameters are documented here. To view all available parameters and their usage information, run the first cell of the pipeline notebook. New parameters are added regularly. Parameters can be set for all runs or per-run.

Parameter Default Description
manifest n/a The path of the manifest file describing the input.
output n/a The path where pipeline output is written.
gvcfDeltaOutput n/a The path where the region-filtered input gVCF is ingested to Delta Lake before genotyping.
exportVCF false If true, the pipeline writes results in VCF as well as Parquet.
targetedRegions n/a Path to files containing regions to call. If omitted, calls all regions.
genotypeGivenAlleles false If true, regenotypes variant sites based on the alleles in the input gVCFs.
emitAllSites false If true, retains low quality sites in the output.

Tip

To keep rare variants, set genotypeGivenAlleles and emitAllSites to true. This is equivalent to changing the GATK settings genotyping_mode from DISCOVERY (choose the most probable alleles) to GENOTYPE_GIVEN_ALLELES (use the alleles present in the input gVCFs), and output_mode from EMIT_VARIANTS_ONLY (produces calls only at variant sites) to EMIT_ALL_SITES (produces calls at any callable site regardless of confidence).

Output

The regenotyped variants are all written out to Parquet tables inside the provided output directory. In addition, if you configured the pipeline to export VCFs, they’ll appear under the output directory as well.

output
|---genotypes
    |---Parquet files
|---genotypes.vcf
    |---VCF files

Reference genomes

You must configure the reference genome using environment variables. To use GRCh37, set the environment variable:

refGenomeId=grch37

To use GRCh38, change grch37 to grch38.

Manifest format

The manifest is a file describing where to find the input GVCF files, with each path on a new row.

Tip

Each row may be an absolute path or a path relative to the manifest. You can include globs (*) to match many files.

Additional usage info

The joint genotyping pipeline shares many operational details with the other Azure Databricks pipelines. For more detailed usage information, such as output format structure, tips for running programmatically, and steps for setting up custom reference genomes, see DNASeq Pipeline.

Joint Genotyping Pipeline