Create a Genomics Delta Lake Table


The Azure Databricks Spark SQL VCF reader requires Databricks Runtime HLS, which is in Beta. Sign up for access.

Genomics data is usually stored in specialized flat-file formats such as VCF or BGEN.

The notebook below shows how to convert a VCF into a genomics Delta Lake table using Python (R, Scala, and SQL are also supported) and Databricks Runtime HLS.

Delta Lake tables can be used for second-latency queries, performant range-joins (similar to the single-node bioinformatics tool bedtools intersect), aggregate analyses such as calculating summary statistics, machine learning or deep learning.


We recommend ingesting VCF files into Delta Lake tables once volumes reach >1000 samples, >10 billion genotypes or >1 terabyte.