Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security exceptions when working with databrick unity catalog #2459

Open
2 tasks done
aamir-rj opened this issue Oct 9, 2024 · 14 comments
Open
2 tasks done

Security exceptions when working with databrick unity catalog #2459

aamir-rj opened this issue Oct 9, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@aamir-rj
Copy link

aamir-rj commented Oct 9, 2024

What happens?

When working on cluster which are in shared mode on unity catalog, splink throws py security exceptions

To Reproduce

Image

OS:

databricks

Splink version:

pip install splink==2.1.14

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@aamir-rj aamir-rj added the bug Something isn't working label Oct 9, 2024
@fscholes
Copy link

Incidentally enough, just yesterday I was talking to Databricks about this, and it's because Splink uses custom jars that aren't supported on shared mode/serverless clusters. They told me it is on the roadmap.

The current way around it, other than redeveloping Splink to remove the custom jars, is to use a single-user cluster on Databricks

@RobinL
Copy link
Member

RobinL commented Oct 10, 2024

@aamir-rj I'm afraid we don't have access to databricks so can't really help out with these sorts of errors. There are various discussions of similar issues that may (but may not) help:

#2295

Thanks @fscholes!. Incidentally, If you ever have chance to mention it to databricks, it'd be great if they could simply add a couple of (fairly simple) functions into DataBricks itself so the custom jar was no longer needed. it causes people a lot of hassle due to these security issues. The full list of custom udfs is here:
https://github.com/moj-analytical-services/splink_scalaudfs

But probably jaro-winkler would be sufficient for the vast majority of users!

@pallyndr
Copy link

Are you aware Databricks has an ARC module, based on Splink?
https://www.databricks.com/blog/linking-unlinkables-simple-automated-scalable-data-linking-databricks-arc

@RobinL
Copy link
Member

RobinL commented Oct 10, 2024

Thanks. Yes. As far as I know, it doesn't get around this jar problem (but I would love to be corrected on that!)

@fscholes
Copy link

I'm aware of ARC, but I don't think it's been actively developed for quite some time now

@aamir-rj
Copy link
Author

aamir-rj commented Oct 13, 2024 via email

@ADBond
Copy link
Contributor

ADBond commented Oct 14, 2024

Another possible way around this for anyone using splink >= 3.9.10 is an option to opt-out of registering the custom jars. In Splink 4 this looks like:

...
spark_api = SparkAPI(spark_session=spark, register_udfs_automatically=False)
linker = Linker(df, settings, db_api=spark_api)
...

See issue #1744 and option added in #1774.

@aamir-rj
Copy link
Author

aamir-rj commented Oct 16, 2024 via email

@ADBond
Copy link
Contributor

ADBond commented Oct 16, 2024

I used the below option but always gets this error nam alt is not defined. Tried installing alt but still same issue Image

From the image it looks like you are mixing Splink 2 (splink.Splink) and Splink 4 (splink.SparkAPI) code. The option register_udfs_automatically is only available in Splink 4 and later versions of Splink 3 - there is no equivalent in Splink 2.
If you are able to upgrade then I would recommend moving to Splink 4, as then all of the documentation will be applicable.

If you are not able to upgrade then you will need to stick to the single user cluster workaround.

@aamir-rj
Copy link
Author

aamir-rj commented Oct 16, 2024 via email

@ADBond
Copy link
Contributor

ADBond commented Oct 16, 2024

What I mean is that the code you appear to be running is not valid Splink 4 code - there is no longer an object Splink to import, nor a method get_scored_comparisons(). You will need to adjust your code to something like:

from splink import Linker, SparkAPI

...
spark_api = SparkAPI(spark_session=spark, register_udfs_automatically=False)
linker = Linker(df, settings, db_api=spark_api)

df_dedupe_result = linker.inference.predict()

As to your error, it appears that you do not have the requirement altair installed, which is why you get the error name 'alt' is not defined. If you install this dependency the error should go away.

@aamir-rj
Copy link
Author

Image

@ADBond
Copy link
Contributor

ADBond commented Oct 16, 2024

@aamir-rj do you have altair installed? It seems from your screenshot that it is not, but hard to tell as not all of the cell output is visible. You can check by seeing what happens if you try to run import altair.

It should install automatically with splink as it is a required dependency - but if not for some reason you can run pip install altair==5.0.1 to get it.

@aymonwuolanne
Copy link
Contributor

@aamir-rj and anyone else who stumbles across this error, I've had the same problem with importing altair, the solution was to restart the python kernel after installing splink, like this:

%pip install splink 

dbutils.library.restartPython()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants