Databricks offers a powerful platform for distributed data processing. However, managing dependencies for Python jobs running on Databricks can be challenging. This post explores the issues and presents a practical solution that has been successfully implemented in real-world projects.
The Databricks Runtime
Databricks clusters come pre-installed with a vast collection of libraries in specific versions, packaged as Databricks Runtimes (DBRs). For example, the latest DBR 14.03 includes Python 3.10.12
, Spark 3.5.0
, and pandas 1.5.3
, among many other libraries.
While these pre-installed libraries are convenient for interactive clusters and jobs that only require the DBR’s default packages, they can conflict with best practices for dependency management using tools like pip-compile
in many real-world applications.
Here’s a typical workflow for dependency management in standard Python development:
- Maintain clear requirements: Define abstract package dependencies in a
requirements.in
(orpyproject.toml
orsetup.py
) file. This file specifies the libraries needed without pinning them to specific versions. - Compile concrete versions: Use a tool like
pip-compile
to readrequirements.in
and generate arequirements.txt
file with specific versions for both direct and indirect dependencies, ensuring compatibility and reproducibility across development, testing, and production environments. - Automate updates: Leverage automation tools like CI/CD pipelines to regularly update
requirements.txt
with the latest compatible versions. This helps stay up-to-date with bug fixes and security patches.
This workflow is often recommended for Python applications such as production batch jobs, but it is not well-suited for package development.
For jobs running on Databricks clusters with a pre-installed DBR, it is ideal to stay close to the provided DBR libraries for compatibility and faster cluster startup times. However, there is often a need to add additional libraries or upgrade specific libraries to access bug fixes or new features.
To achieve this, we need to adjust the standard approach for dependency management to keep most libraries at their DBR versions while allowing different versions of specified libraries. Additionally, we usually do not want to include everything from the DBR in a requirements.txt
file, as this can significantly slow down CI pipeline steps that are not running on a Databricks cluster.
Constrained Dependencies
pip-compile
’s constraint files offer an elegant solution. These files allow you to specify version constraints for libraries without directly installing them. Essentially, they tell pip-compile
: “We don’t need these libraries per se, but if we need any of them to resolve the requirements from requirements.in
, please use this version.”
Here’s a modified workflow for Databricks:
-
Download DBR requirements: Databricks provides a
requirements.txt
file for each DBR. They are linked on the DBR’s Page. For example, the file for DBR 14.03 is here. -
Convert to a constraint file: Modify the DBR requirements by removing libraries where you want to manage versions independently. For example, if you want to use DBR 14.03 but need an older version of TensorFlow, you can execute the following command to pin all versions except TensorFlow to the DBR’s versions:
curl https://docs.databricks.com/en/_extras/documents/requirements-14.3.txt | grep -v ^tensorflow > constraints-dbr-14-03.txt
-
Compile with constraints: Use
pip-compile
to generate yourrequirements.txt
file, referencing the constraint file to ensure compatibility with the broader DBR ecosystem:pip-compile -c constraints-dbr-14-03.txt requirements.in
-
Automate updates: Integrate automation tools to regularly update
constraints-dbr-14-03.txt
because databricks sometimes includes bug-fix updates in their DBRs. Then update therequirements.txt
based on the new constraint file.
You may need to remove also indirect dependencies of the libraries you are changing from
constraints file. pip-compile
will tell you about such version incompatibles during compilation.
Usually, just get to a resolvable setup after a few rounds of removing libraries from the
constraints file.
This approach allows you to leverage the stability of pre-installed DBR libraries while maintaining control over specific dependencies for optimal flexibility and maintainability in your Databricks Python Jobs.