Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

databricksruntime/python:12.2-LTS: installed packages do no match the ones documented in the databricks runtime #150

Open
marcindulak opened this issue Nov 16, 2023 · 7 comments

Comments

@marcindulak
Copy link

marcindulak commented Nov 16, 2023

I'm comparing the packages installed on databricksruntime/python:12.2-LTS with the list of packages at https://docs.databricks.com/en/release-notes/runtime/12.2lts.html extracted as runtime.txt

IMAGE=databricksruntime/python:12.2-LTS
docker run --rm -it --name databricks  $IMAGE bash -c "/databricks/python3/bin/python -m pip freeze" > image.txt
meld runtime.txt image.txt

I'm not providing the full diff, but it's visible on the screenshot that some package versions do not match, and the container image contains a smaller number of packages compared to the official runtime documentation.

Screenshot from 2023-11-16 19-44-17

@xinzhao-db
Copy link
Contributor

This is by design. DCS is not a DBR replicate, it provides a way to customize container environment, hence we only install required packages for running DBR and users can install other packages in their own customization layer on top if they want.

@marcindulak
Copy link
Author

This is by design. DCS is not a DBR replicate, it provides a way to customize container environment, hence we only install required packages for running DBR and users can install other packages in their own customization layer on top if they want.

This approach causes issues like #148 of #131, as the packages installed on DCS appear untested.

Users of DCS need to:
1: Collect a list of packages from DBR documentation, e.g. https://docs.databricks.com/en/release-notes/runtime/12.2lts.html
2: Try to overwrite the packages in DCS accordingly
3: Test the functionality of DCS

If 1 to 3 were in the design scope, that would reduce the amount of work placed on the customers of Databricks.

@xinzhao-db
Copy link
Contributor

xinzhao-db commented Feb 8, 2024

  1. Custom containers built on databricksruntime/standard are still missing required packages? #131 is not an issue IMO because:
  • We only support DCS on DBR LTS version and DBR 13.2 is not an LTS version
  • We mentioned in README.md to not use :latest
  • :13.3-LTS should work
  1. databricksruntime/python:12.2-LTS: jinja2 ImportError #148 is an issue due to different versions than DBR, will fix them

  2. Customers are using DCS differently:

  • Some customers want to add additional packages
  • Some customers want to overwrite with a different version but they don't know if it will cause a conflict
  • Some customers want to remove all the packages they don't need but they don't know if the removal will cause issue

To best serve all above needs, we only provide minimal set of packages in DCS so that in most cases, customer won't have concerns as long as they don't touch the pre-installed packages in DCS.

DCS provides flexibility for customers to customize the environment and we are responsible for providing a start point (the example images) for customer to start their customization. However, customers are always responsible for testing their customization, because even adding a new package can introduce upgrade on another package and may cause conflicts. (In your case, if you just add the package of the same version from DBR, yes, you don't need to test that part)

@marcindulak
Copy link
Author

  1. Customers are using DCS differently:
  • Some customers want to add additional packages
  • Some customers want to overwrite with a different version but they don't know if it will cause a conflict
  • Some customers want to remove all the packages they don't need but they don't know if the removal will cause issue

I would also like to consider the following case

Some customers want the package versions in the DCS image to correspond to the DBR runtime

To best serve all above needs, we only provide minimal set of packages in DCS so that in most cases, customer won't have concerns as long as they don't touch the pre-installed packages in DCS.

The screenshot attached to the initial issue description shows otherwise: customers must reinstall some packages in an DCS image in order for the packages to match the versions in DBR.

However, customers are always responsible for testing their customization, because even adding a new package can introduce upgrade on another package and may cause conflicts.

I would like to avoid those risks. I would expect the packages in the DCS base image to correspond exactly to the versions in DBR, so customers don't need to attempt overriding them to match the versions present in DBR, as this may introduce or leave in place other, incompatible package versions as dependencies.

Some suggestions for improving the quality of the images produced in this repo:

  1. if the mismatch between the DCS and DBR package versions is to be continued, introducing post-docker build basic functionality tests of the packages included in the DCS image, e.g. using github actions
  2. making available a machine readable (e.g. requirements.txt for Python) list of the packages present in DBR (in addition to human readable tables at e.g. https://docs.databricks.com/en/release-notes/runtime/12.2lts.html), so customers can fetch that list programmatically and override the packages installed in DCS images. On the other hand, if such a list was available, then maybe DCS images could use it as well?

@xinzhao-db
Copy link
Contributor

I think I didn't express clearly enough. I thought you mentioned two issues:

  1. package version mismatch compared to corresponding version of DBR. It's indeed an issue and we will fix it.
  2. missing packages compared to corresponding version of DBR. As I mentioned above, it is by design. Are you suggesting change this as well?

@marcindulak
Copy link
Author

marcindulak commented Feb 9, 2024

I think I didn't express clearly enough. I thought you mentioned two issues:

  1. package version mismatch compared to corresponding version of DBR. It's indeed an issue and we will fix it.

I'll link here also an older issue about versions mismatch between DCS and DBR on LTS #87, to be closed after a general fix is implemented.

  1. missing packages compared to corresponding version of DBR. As I mentioned above, it is by design. Are you suggesting change this as well?

If the design is open for improvements, then yes, I think point 2. above would be an optimal starting point for the DCS images. I understand it is a lot of work, and may even be infeasible unless all versions of packages match, by using pip freeze to capture all sys.path locations on DBR, and apply that list onto the DCS images.

@ericfeunekes
Copy link

2. missing packages compared to corresponding version of DBR. As I mentioned above, it is by design. Are you suggesting change this as well?

I think I see what you mean and I agree to some extent that having the flexibility is great. However, I think it would be very useful to have the option for an image that matches DBR. You have a minimal image, then one that installs python. Why not also have one that is a match to the applicable DBR?

The benefit would be that it'd be easy for customers to also grab the requirements file, make whatever adjustments they need, and create their own version from the base python image.

I also find it confusing that the repository is named "databricksruntime" but it doesn't actually match the DBR. I know it's too late to change that. Just confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants