Cluster tools for running Dask on Databricks multi-node clusters.
To launch a Dask cluster on Databricks you need to create an init script with the following contents and configure your multi-node cluster to use it.
#!/bin/bash
# Install Dask + Dask Databricks
/databricks/python/bin/pip install --upgrade dask[complete] dask-databricks
# Start Dask cluster components
dask databricks run
Then from your Databricks Notebook you can quickly connect a Dask Client
to the scheduler running on the Spark Driver Node.
import dask_databricks
client = dask_databricks.get_client()
Now you can submit work from your notebook to the multi-node Dask cluster.
def inc(x):
return x + 1
x = client.submit(inc, 10)
x.result()
You can access the Dask dashboard via the Databricks driver-node proxy. The link can be found in Client
or DatabricksCluster
repr or via client.dashboard_link
.
>>> print(client.dashboard_link)
https://dbc-dp-xxxx.cloud.databricks.com/driver-proxy/o/xxxx/xx-xxx-xxxx/8087/status
Releases of this project are automated using GitHub Actions and the pypa/gh-action-pypi-publish
action.
To create a new release push a tag to the upstream repo in the format x.x.x
. The package will be built and pushed to PyPI automatically and then later picked up by conda-forge.
# Make sure you have an upstream remote
git remote add upstream git@github.com:dask-contrib/dask-databricks.git
# Create a tag and push it upstream
git tag x.x.x && git push upstream main --tags