Dependency Management for Python Applications on Databricks

2 minute read

Databricks offers a powerful platform for distributed data processing. However, managing dependencies for Python jobs running on Databricks can be tricky. This post explores the challenges and a practical solution used successfully in real-world projects.

The Databricks Runtime

Databricks clusters come pre-installed with a vast collection of libraries in specific versions, packaged as Databricks Runtimes (DBRs). For example, the most current DBR 14.03 contains Python 3.10.12, Spark 3.5.0 and pandas 1.5.3, among many other libraries.

While convenient for interactive cluster as well as jobs that only need requirements contained in the DBR, these fixed versions clash with best practices of dependency management using tools like pip-compile in many real-life applications.

Here’s a typical workflow for dependency management in standard Python development:

Maintain clear requirements: We define abstract package dependencies in a requirements.in (or pyproject.toml or setup.py) file. This file specifies the libraries we need without pinning them to specific versions.
Compile concrete versions: A tool like pip-compile reads requirements.in and generates a requirements.txt file with specific versions for both direct and indirect dependencies, ensuring compatibility and reproducibility between development, testing, and production environments.
Automate updates: We leverage automation tools like CI/CD pipelines to regularly update requirements.txt with the latest compatible versions. This helps us stay up-to-date with bug fixes and security patches.

This workflow is often recommend for Python applications such as productive batch jobs, but not well-suited for package development.

Now for jobs running on Databricks clusters with a pre-installed BDR, we ideally want to stay close to the provided DBR libraries for compatibility and faster cluster startup times. However we often need the flexibility to add additional libraries or upgrade specific libraries to get access to bugfixes or new features.

To make this possible, we need to adjust the standard approach for dependency management to keep most libraries at their DBR versions, but allow different version of specified libraries. Also, we usually do not want to include everything from the DBR in a requirements.txt file, as this can slow down e.g. CI-Pipeline steps not running on a Databricks Cluster dramatically.

Constrained Dependencies

pip-compile’s constraint files offer an elegant solution. These files allow you to specify version constraints for libraries without directly installing them. Essentially, they tell pip-compile: “We don’t need these libraries per-se, but if we need any of them to resolve the requirements from requirements.in, please use this version.”

Here’s a modified workflow for Databricks:

Download DBR requirements: Databricks provides a requirements.txt file for each DBR. They are linked on the DBR’s Page. For example, the file for DBR 14.03 is here.
Convert to a constraint file: Modify the DBR requirements by removing libraries where you want to manage versions independently. For example, if we want to use DBR 14.03, but need an older version of tensorflow, we can execute the following to pin all version but tensorflow to the DBR’s versions:
```
 curl https://docs.databricks.com/en/_extras/documents/requirements-14.3.txt | grep -v ^tensorflow > constraints-dbr-14-03.txt
```
Compile with constraints: Use pip-compile to generate your requirements.txt file, referencing the constraint file to ensure compatibility with the broader DBR ecosystem:
```
 pip-compile -c constraints-dbr-14-03.txt requirements.in
```
Automate updates: Integrate automation tools to regularly update constraints-dbr-14-03.txt because databricks sometimes includes bug-fix updates in their DBRs. Then update the requirements.txt based on the new constraint file.

You may need to remove also indirect dependencies of the libraries you are changing from constraints file. pip-compile will tell you about such version incompatibles during compilation. Usually, just get to a resolvable setup after a few rounds of removing libraries from the constraints file.

This approach allows you to leverage the stability of pre-installed DBR libraries while maintaining control over specific dependencies for optimal flexibility and maintainability in your Databricks Python Jobs.

Dependency Management for Python Applications on Databricks

The Databricks Runtime

Constrained Dependencies

You may also enjoy

Use Dynamic Partition Overwrite for ETL with Apache Spark