Variant Quality Control

Databricks Runtime HLS includes a variety of tools for variant quality control.

Tip

This topic uses the terms “variant” or “variant data” to refer to single nucleodite variants and short indels.

You can calculate quality control statistics on your variant data using Spark SQL functions, which can be expressed in Python, R, Scala, or SQL.

Function Arguments Return
hardy_weinberg The genotypes array. This function assumes that the variant has been converted to a biallelic representation. A struct with two elements: the expected heterozygous frequency according to Hardy-Weinberg equilibrium and the associated p-value.
call_summary_stats The genotypes array

A struct containing the following summary stats:

  • callRate: The fraction of samples with a called genotype
  • nCalled: The number of samples with a called genotype
  • nUncalled: The number of samples with a missing or uncalled genotype, as represented by a ‘.’ in a VCF or -1 in a DataFrame.
  • nHet: The number of heterozygous samples
  • nHomozygous: An array with the number of samples that are homozygous for each allele. The 0th element describes how many sample are hom-ref.
  • nNonRef: The number of samples that are not hom-ref
  • nAllelesCalled: An array with the number of times each allele was seen
  • alleleFrequencies: An array with the frequency for each allele
dp_summary_stats The genotypes array A struct containing the min, max, mean, and sample standard deviation for genotype depth (DP in VCF v4.2 specificiation) across all samples
gq_summary_stats The genotypes array A struct containing the min, max, mean, and sample standard deviation for genotype quality (GQ in VCF v4.2 specification) across all samples