Dependency Management for Python Applications on Databricks
Databricks offers a powerful platform for distributed data processing. However, managing dependencies for Python jobs running on Databricks can be tricky. This post explores the challenges and a practical solution used successfully in real-world projects.
The Databricks Runtime
Databricks clusters come pre-installed with a vast collection of libraries in specific versions,
packaged as Databricks Runtimes (DBRs). For example, the most current DBR 14.03 contains
Python 3.10.12
, Spark 3.5.0
and pandas 1.5.3
, among many other libraries.
While convenient for interactive cluster as well as jobs that only need requirements contained in the DBR, these fixed versions clash with best practices of dependency management using tools like pip-compile in many real-life applications.
Here’s a typical workflow for dependency management in standard Python development:
- Maintain clear requirements: We define abstract package dependencies in a
requirements.in
(orpyproject.toml
orsetup.py
) file. This file specifies the libraries we need without pinning them to specific versions. - Compile concrete versions: A tool like
pip-compile
readsrequirements.in
and generates arequirements.txt
file with specific versions for both direct and indirect dependencies, ensuring compatibility and reproducibility between development, testing, and production environments. - Automate updates: We leverage automation tools like CI/CD pipelines to regularly
update
requirements.txt
with the latest compatible versions. This helps us stay up-to-date with bug fixes and security patches.
This workflow is often recommend for Python applications such as productive batch jobs, but not well-suited for package development.
Now for jobs running on Databricks clusters with a pre-installed BDR, we ideally want to stay close to the provided DBR libraries for compatibility and faster cluster startup times. However we often need the flexibility to add additional libraries or upgrade specific libraries to get access to bugfixes or new features.
To make this possible, we need to adjust the standard approach for dependency management to keep
most libraries at their DBR versions, but allow different version of specified libraries.
Also, we usually do not want to include everything from the DBR in a requirements.txt
file,
as this can slow down e.g. CI-Pipeline steps not running on a Databricks Cluster dramatically.
Constrained Dependencies
pip-compile
’s constraint files offer an elegant solution. These files allow you to specify
version constraints for libraries without directly installing them. Essentially, they tell
pip-compile
: “We don’t need these libraries per-se, but if we need any of them to resolve the
requirements from requirements.in
, please use this version.”
Here’s a modified workflow for Databricks:
-
Download DBR requirements: Databricks provides a
requirements.txt
file for each DBR. They are linked on the DBR’s Page. For example, the file for DBR 14.03 is here. -
Convert to a constraint file: Modify the DBR requirements by removing libraries where you want to manage versions independently. For example, if we want to use DBR 14.03, but need an older version of tensorflow, we can execute the following to pin all version but tensorflow to the DBR’s versions:
curl https://docs.databricks.com/en/_extras/documents/requirements-14.3.txt | grep -v ^tensorflow > constraints-dbr-14-03.txt
-
Compile with constraints: Use
pip-compile
to generate yourrequirements.txt
file, referencing the constraint file to ensure compatibility with the broader DBR ecosystem:pip-compile -c constraints-dbr-14-03.txt requirements.in
-
Automate updates: Integrate automation tools to regularly update
constraints-dbr-14-03.txt
because databricks sometimes includes bug-fix updates in their DBRs. Then update therequirements.txt
based on the new constraint file.
You may need to remove also indirect dependencies of the libraries you are changing from
constraints file. pip-compile
will tell you about such version incompatibles during compilation.
Usually, just get to a resolvable setup after a few rounds of removing libraries from the
constraints file.
This approach allows you to leverage the stability of pre-installed DBR libraries while maintaining control over specific dependencies for optimal flexibility and maintainability in your Databricks Python Jobs.