SAIGE and SAIGE-GENE

SAIGE is a command line tool written in R for performing genetic association tests on biobank-scale data. Run the single-variant association test method SAIGE and the region-based association test method SAIGE-GENE on Databricks Runtime HLS using the Pipe Transformer.

Beta

Running SAIGE requires Databricks Runtime HLS, which is in Beta.

Create a SAIGE cluster

Pull HTSLib as well as the SAIGE package, its dependencies, and associated input files down to every node in the Spark cluster via an initialization script. We recommend creating a mount point to store and download these files from cloud storage. Replace the values in the script with your mount point and local directory.

#!/usr/bin/env bash

set -ex
set -o pipefail

# Install SKAT for SAIGE-GENE
sudo apt-get -y install libboost-all-dev autoconf
Rscript -e 'install.packages("SKAT", repos="http://cran.us.r-project.org")'

# Install SAIGE
cd /opt
git clone https://github.com/weizhouUMICH/SAIGE
cd SAIGE
Rscript extdata/install_packages.R
R CMD INSTALL *.tar.gz

# Install HTSLIB
cd /opt
git clone https://github.com/samtools/htslib
cd htslib
autoheader
autoconf
./configure
make
make install

# Sync SAIGE input files
aws=/databricks/python2/bin/aws
source /databricks/init/setup_mount_point_creds.sh <mount-point>
$aws s3 sync "$MOUNT_ROOT/<saige-input-dir>" "<local-dir>"

Run association tests

To parallelize the association test using Spark, use the Databricks Pipe Transformer.

  • Command: A JSON array containing a SAIGE or SAIGE-GENE script

  • Input formatter: VCF

  • Output formatter: CSV

    • Header flag: true
    • Delimiter: space
import dbgenomics as dbg
import json

cmd = json.dumps(["bash", "-c", saige_script])
output_df = dbg.transform("pipe", input_df, cmd=cmd, inputformatter='vcf', in_vcfheader=input_vcf,
                          outputformatter='csv', out_header='true', out_delimiter=' ')

The input DataFrame can be read directly from a VCF or from a Delta Lake with VCF rows.

Single-variant association test (SAIGE)

The following cell shows an example SAIGE script to be piped. The options are explained in the single-variant association test docs.

#!/bin/sh
set -e

export tmpdir=$(mktemp -d -t vcf.XXXXXX)
cat - > ${tmpdir}/input.vcf
bgzip -c ${tmpdir}/input.vcf > ${tmpdir}/input.vcf.gz
tabix -p vcf ${tmpdir}/input.vcf.gz

Rscript /opt/SAIGE/extdata/step2_SPAtests.R \
    --vcfFile=${tmpdir}/input.vcf.gz \
    --vcfFileIndex=${tmpdir}/input.vcf.gz.tbi \
    --SAIGEOutputFile=${tmpdir}/output.txt \
    --sampleFile=<local-dir>/input/sampleIds.txt \
    --GMMATmodelFile=<local-dir>/step-1/output.rda \
    --varianceRatioFile=<local-dir>/step-1/output.varianceRatio.txt \
    --vcfField=GT \
    --chrom=22 \
    --minMAF=0.0001 \
    --minMAC=1 \
    --numLinesOutput=2 \
    >&2

cat ${tmpdir}/output.txt
rm -rf ${tmpdir}

Region-based association test (SAIGE-GENE)

Before piping, partition the input DataFrame by region or gene, with each partition containing the sorted, distinct and relevant set of VCF rows. The regions and corresponding variants should match those in the group file referenced in the SAIGE-GENE script. Regions can be derived from any source, such as gene annotations from the SnpEff pipeline.

input_df = genotype_df.join(region_df, ["contigName", "start", "end"]) \
  .repartition("geneName") \
  .select("contigName", "start", "end", "referenceAllele", "alternateAlleles", "genotypes") \
  .sortWithinPartitions("contigName", "start")

The following cell shows an example SAIGE-GENE script to be piped. The options are explained in the region-based association test docs.

#!/bin/sh
set -e

export tmpdir=$(mktemp -d -t vcf.XXXXXX)
uniq -w 100 - > ${tmpdir}/input.vcf
bgzip -c ${tmpdir}/input.vcf > ${tmpdir}/input.vcf.gz
tabix -p vcf ${tmpdir}/input.vcf.gz

Rscript /opt/SAIGE/extdata/step2_SPAtests.R \
    --vcfFile=${tmpdir}/input.vcf.gz \
    --vcfFileIndex=${tmpdir}/input.vcf.gz.tbi \
    --SAIGEOutputFile=${tmpdir}/output.txt \
    --groupFile=${tmpdir}/groupFile.txt \
    --sampleFile=<local-dir>/input/sampleIds.txt \
    --GMMATmodelFile=<local-dir>/step-1/output.rda \
    --varianceRatioFile=<local-dir>/step-1/output.varianceRatio.txt \
    --sparseSigmaFile=<local-dir>/step-1/output.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx \
    --vcfField=GT \
    --chrom=22 \
    --minMAF=0 \
    --minMAC=0.5 \
    --maxMAFforGroupTest=0.5 \
    --numLinesOutput=2 \
    >&2

cat ${tmpdir}/output.txt
rm -rf ${tmpdir}

After piping, disambiguate the association results for each gene in the output DataFrame.

from pyspark.sql import Window

window_spec = Window.partitionBy("Gene")
assoc_df = output_df.withColumn("numMarkerIDs", size(split("markerIDs", ";"))) \
  .withColumn("maxNumMarkerIDs", max("numMarkerIDs").over(window_spec)) \
  .filter("numMarkerIDs = maxNumMarkerIDs") \
  .drop("numMarkerIDs", "maxNumMarkerIDs") \
  .drop_duplicates(["Gene"])

Example notebooks

Example null logistic mixed model fit notebook

Example SAIGE-GENE notebook