Hail is a library built on Spark for analyzing large genomic datasets. Hail 0.2 is integrated into the Databricks Runtime HLS to simplify and scale your genomic analyses.
Hail 0.2 and integration with Azure Databricks are both in Beta. Interfaces inside Hail are likely to change, as are properties of the Azure Databricks environment, such as which Python packages are available by default. Pricing is also subject to change before general availability.
To create a cluster with Hail installed:
In the Custom Spark Version field, paste in the version key for the Databricks Runtime HLS.
Set the following environment variable:
This environment variable causes the cluster to launch with Hail 0.2, its dependencies, and Python 3.6 installed.
For the most part, Hail 0.2 code in Azure Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Azure Databricks environment.
When initializing Hail, you must pass in the pre-created
SparkContext and mark the initialization
import hail as hl hl.init(sc, idempotent=True)
Hail uses the Bokeh library to create plots. The
show function built into Bokeh does not work
in Azure Databricks. To display a Bokeh plot generated by Hail, you can run a command like:
from bokeh.embed import components, file_html from bokeh.resources import CDN plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP') html = file_html(plot, CDN, "Chart") displayHTML(html)
See Bokeh in Python Notebooks for more information.
- When Hail support is enabled, your cluster uses Python 3.6, so notebooks written against different versions of Python may not work.
- When Hail support is enabled, fewer Python libraries are installed by default. You can still use the Libraries feature to install new libraries.
After you’ve set up a Hail cluster, try out the Hail overview notebook.
This notebook is too large to display inline. Get notebook link.