Overlap joins, which report the shared intervals between two overlapping features, are a common operation in genomic analysis. A canonical example is identifying the aligned genomic reads which overlap with a set of target sites. In a single node setting, this is often accomplished using the bedtools intersect command line tool.
Databricks Runtime HLS automatically optimizes for overlap joins in Spark SQL.
To use the overlap join optimization, import the
overlaps function and pass in four DataFrame
columns corresponding to the start and end of the pair of features. We define coordinates according
to half-open intervals, in which the start is included and the end is excluded.
from hls.expressions import overlaps reads.join(targets, overlaps(reads.reads_start, reads.reads_end, targets.targets_start, targets.targets_end))
import org.apache.spark.sql.hls.dsl.expressions.overlaps reads.join(targets, overlaps(reads("reads_start"), reads("reads_end"), targets("targets_start"), targets("targets_end")))
The optimization for overlap join is based on creating bins in the feature range. The default bin size is set to 5000 base pairs by default. If you are working on an application with significantly longer targets, we suggest increasing your bin size as follows.