Create a Genomics Delta Lake

Beta

The Azure Databricks Spark SQL VCF reader requires Databricks Runtime HLS, which is in Beta. Sign up for access.

Genomics data is usually stored in specialized flat-file formats such as VCF or BGEN.

The notebook below shows how to ingest a VCF into a genomics Delta Lake table using Python (R, Scala, and SQL are also supported) and Databricks Runtime HLS.

You can use Delta tables for second-latency queries, performant range-joins (similar to the single-node bioinformatics tool bedtools intersect), aggregate analyses such as calculating summary statistics, machine learning or deep learning.

Tip

We recommend ingesting VCF files into Delta tables once volumes reach >1000 samples, >10 billion genotypes or >1 terabyte.

VCF to Delta Lake table notebook

This notebook is too large to display inline. Get notebook link.