Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classpath stability for tests #261

Open
seancorfield opened this issue Nov 14, 2022 · 25 comments
Open

Classpath stability for tests #261

seancorfield opened this issue Nov 14, 2022 · 25 comments
Labels
improvement Not a bug but an improvement to overall user experience

Comments

@seancorfield
Copy link
Contributor

Describe the bug
The classpath that poly test calculates is not stable/reproducible across machines, leading to unpredictable behavior of tests across multiple machines.

To Reproduce
I added debug output to the test orchestrator to print all-paths so that I could see the order being different on multiple machines. The Jetty libraries have changed their group/artifact over time but retained some of the same class names which means that a non-deterministic ordering of the classpath means that you can have tests pass on one machine but fail on another due to that ordering cause a different artifact to be found "first" on the classpath and an incompatible class used.

Expected behavior
tools.deps.alpha uses a specific sorting algorithm to ensure a reproducible classpath across all environments -- the ideal would be Polylith producing that exact same order (this is tricky because poly test has to construct a projection of the :test alias across all the bricks in order to construct the "basis"). Not so ideal but acceptable would at least be for the order of the classpath produced by poly test to be deterministic. That means not using sets or hash maps to store elements of the classpath.

@tengstrand
Copy link
Collaborator

This should be solved by #263 @seancorfield I think.

@seancorfield
Copy link
Contributor Author

No. This is not fixed. Please reopen it.

@tengstrand tengstrand added the improvement Not a bug but an improvement to overall user experience label Mar 13, 2023
@tengstrand
Copy link
Collaborator

I now sort the libs (line 17) in the issue-261 branch:

(defn resolve-deps [{:keys [name] :as project} settings is-verbose color-mode]
(try
(sort (into #{} (mapcat #(-> % second :paths))
(deps/resolve-deps project settings is-verbose)))

I'm not sure if that solves your problem @seancorfield. I couldn't figure out how to let the poly tool produce the exact same order as tools.deps (even though you tried to explain it).

@seancorfield
Copy link
Contributor Author

That may make things worse -- but it does at least make things predictable (even if it ends up being predictably wrong).

I want to keep this open until I can spend some time looking into this problem. It's not going to affect most people, with smaller projects, but this sort of thing can have a really bad impact on projects of the scale we have at work.

@ieugen
Copy link
Contributor

ieugen commented May 5, 2023

I think I've also encountered this bug.

In my case I get this error while running clojure -M:poly test

Caused by: java.sql.SQLException: No suitable driver
 at java.sql.DriverManager.getDriver (DriverManager.java:299)
    com.zaxxer.hikari.util.DriverDataSource.<init> (DriverDataSource.java:106)
    com.zaxxer.hikari.pool.PoolBase.initializeDataSource (PoolBase.java:326)
    com.zaxxer.hikari.pool.PoolBase.<init> (PoolBase.java:112)
    com.zaxxer.hikari.pool.HikariPool.<init> (HikariPool.java:93)
    com.zaxxer.hikari.HikariDataSource.getConnection (HikariDataSource.java:112)
    next.jdbc.connection$make_connection.invokeStatic (connection.clj:452)
    next.jdbc.connection$make_connection.invoke (connection.clj:436)
    next.jdbc.connection$eval22475$fn__22476.invoke (connection.clj:481)
    next.jdbc.protocols$eval22205$fn__22206$G__22196__22213.invoke (protocols.clj:25)
    next.jdbc$get_connection.invokeStatic (jdbc.clj:168)
    next.jdbc$get_connection.invoke (jdbc.clj:147)
    dre.semmed_ds.interface_test$fn__93117.invokeStatic (interface_test.clj:17)

I do have com.h2database/h2 {:mvn/version "2.1.214"} added to component class path, :test classpath, dev project classpath and :test classpath and also project test classpath .

I still get the above error.

Inside dev repl I can create the component that creates a hikari datasource using next jdbc component .
I can also run the code for the test and it works.

@tengstrand
Copy link
Collaborator

Is it possible to reproduce the problem in a smaller public repo and share it with me @ieugen?

@ieugen
Copy link
Contributor

ieugen commented May 8, 2023

@tengstrand : I added a repro in this PR ieugen/poly-rcf#1 .

I run clojure -M:poly test .
I have component added to 2 projects.
First project tests are ran ok.
Second project fails with missing class.

Big questions are:

  • Why are tests ran 2 times for the same component?
  • Why is the classpath different for second run?

@wkok
Copy link

wkok commented Mar 2, 2024

Any update on this issue?

We're encountering the same problem on some machines when running poly tests and at first we thought it's related to a specific dependency's problem as we've logged here:
https://discuss.xtdb.com/t/arrow-vector-conflicting-dependency/380/2

but then I discovered this issue which might explain better what we're seeing.

@tengstrand
Copy link
Collaborator

I have some ideas on how to solve this, so will have a look.

@seancorfield
Copy link
Contributor Author

seancorfield commented Mar 2, 2024

The reason we haven't spent any time on this at work (since we were the most affected and the first to be affected) is that we switched to the external test runner and, combined with a small change to our dependencies (which I no longer remember, sorry), the problem went away.

I am concerned that any changes to the classpath order at this point could break someone's setup, i.e., making it worse for some people, even if it makes it better for others.

Part of the problem here is that poly test uses the projects deps.edn but then augments the :test alias with parts of the :test alias from each of the bricks' deps.edn -- this is inherently different to how tools.deps does things and so there's no "global" view of dependencies. Artificially ordering the classpath is almost certainly the wrong approach (effectively what the suggested change of March 30th, 2023 proposed).

Polylith's decision to have :test aliases in bricks with each brick's test-only dependencies is part of the cause here but there's no reliable way to combine those into the projects dependencies (and then use tools.deps just once on the result) and guarantee that you'll a sane behavior, unfortunately.

I've continued to think about this problem on and off over the past year and I still don't have a good answer. The closest I've come is to make a fork of tools.deps that supports transitive alias usage across :local/root dependencies, since that's really the only way to guarantee the behavior that poly test is going to need (and that's non-trivial -- I've taken a few runs at it, unsuccessfully).

@tengstrand
Copy link
Collaborator

I have copied (and slightly changed) the poly-rcf example to examples/poly-rcf in the polylith repo, that also includes all the instructions on how to test the code and what changes that have been made. I also wanted to see if the code worked as a monolith, so I created a monolith directory that contains the same code and libraries.

I still don't understand why the default test runner doesn't behave exactly the same on all machines. I get that it doesn't behave the same as tools.deps, but it should at least behave the same every time. So if it works on one machine, it should also work on another machine? Maybe you have some ideas?

@tengstrand
Copy link
Collaborator

The main problem was that the tests didn't run in sufficient isolation, but when I switched to the external test runner it worked! I have summarized what I did here.

@seancorfield
Copy link
Contributor Author

Good to know that the external test runner is a potential solution here.

One extra thing it has allowed us to do (and a blog post is coming on this at some point): merge our poly test into our build.clj process. Previously (0.2.18), we used to spawn a new Java process in our build that ran poly test as a command -- we'd originally done this to isolate our tests from our build deps -- then with the external test runner, we ended up with three processes running: build, poly test, and the tests themselves. With 0.2.19, we've added the poly dependency to our :build alias and the poly API allows us to kick off poly test (essentially) in the same process as build, and then our tests run in a completely isolated process.

The new poly API also allows us to compute changed projects in build instead of spawning a Java process to run the poly command so that has speeded things up a bit too.

tengstrand added a commit that referenced this issue Mar 6, 2024
issue #261
* Added examples/poly-rcf.
* 0.2.20-SNAPSHOT #6.
@tengstrand
Copy link
Collaborator

tengstrand commented Mar 7, 2024

I was thinking that it could be time to get rid of these isolation problems. My idea is to copy the external test runner code into the polylith repo (and make necessary modifications). What do you think @seancorfield? Then we could either make this the default test runner (in that case we need to discuss this with the community). Or we could set it in the global test config when a workspace is created. When we implement a test runner for ClojureScript (when we have support for cljs) we could make it isolated too, and everything would be consistent. @furkan3ayraktar @imrekoszo

@seancorfield
Copy link
Contributor Author

I'm torn. I think encouraging multiple test runner implementations is a good thing. I don't think favoring one over another is ideal.

For most projects, Polylith's in-process test runner is just fine. It was fine for us for quite a while, even at the scale (complexity) we reached.

Some people are going to want to use Kaocha -- and maybe they'd want the option of an in-process vs external test running (the latter is not available as I understand it?).

There are also all sorts of controls folks might want to apply to their test running process: should it include src as well as test (I recently created an issue about that in my external test runner), should tests be filtered by metadata somehow, e.g,. the ability to mark tests as "integration" (vs unit) so you could ask poly test to only run certain types of tests.

Basically, there are all sorts of "knobs and dials" that test runners provide that -- currently -- poly test has no way to tap into. Look at the options for Cognitect's test runner and Kaocha. Somehow, we should provide a way for those sorts of configuration to be easily available to any test runner.

I'd prefer to stay with the current overall machinery but improve the test runner "API" and make it easier for poly test to pass additional (command-line) options through to the underlying test runner. It would be nice if Polylith's documentation around testing talked about how to use the two available test runners, which might encourage others to write their own.

@tengstrand
Copy link
Collaborator

Good points!

@imrekoszo
Copy link
Contributor

imrekoszo commented Mar 7, 2024

I agree with Sean's last comment here, here are my thoughts:

Kaocha -- and maybe they'd want the option of an in-process vs external test running (the latter is not available as I understand it?)

Yep, I have been wanting to add support for this to polylith-kaocha but I haven't got to it so far.

there are all sorts of "knobs and dials" that test runners provide

This is very true. There is a good blog post from back in 2018 about the various responsibilities around running tests, and how Kaocha chose to address a subset of those.

I had made a related comment on another issue here in 2021 and I still agree with the main points in that:

  • poly should be responsible for
    • calculating project classpaths
    • determining which projects/bricks are affected by changes since the stable point
  • test runners should be responsible for
    • discovering/running/reporting etc. tests given a classpath

A default, built-in test runner helps a lot with getting started with a workspace (and also by serving as an example for 3rd party runners). Whether that's an in-process or external (or there are built-in options for both) doesn't really matter IMO.

BTW, following up on Sean's suggestion about perhaps forking tools.deps for classpath calculation, there's a lib for classloader management - perhaps it can be helpful: https://github.com/lambdaisland/classpath

Edit: now that I look at the api of that lib, it might not be that useful after all

@tengstrand
Copy link
Collaborator

Good input @imrekoszo.

@tengstrand
Copy link
Collaborator

I managed to get the poly-rcf example repo to work (see the comments in the readme). Have you had time to look at you problems, and take similar actions that I did @ieugen @wkok ?

@wkok
Copy link

wkok commented Mar 7, 2024

I managed to get the poly-rcf example repo to work (see the comments in the readme). Have you had time to look at you problems, and take similar actions that I did @ieugen @wkok ?

Hey @tengstrand we have since swapped out dependencies in our project (due to other reasons) and are not seeing this problem anymore with our current set of dependencies when running poly tests.

@ieugen
Copy link
Contributor

ieugen commented Mar 7, 2024

Have you had time to look at you problems, and take similar actions that I did

Not yet. I am busy with some other work. I might not be able to do that in the near future.
I will post here when I do.

@tengstrand
Copy link
Collaborator

I created issue 448. Please take a look and see if it looks reasonable @seancorfield @ieugen.

@seancorfield
Copy link
Contributor Author

I've responded there but I'm not sure why you've created that issue since it doesn't appear to solve any issue raised here (or anywhere else)?

@tengstrand
Copy link
Collaborator

It's mentioned by @ieugen here:
"Polylith does not pass source paths to tests so to make tests unde src/ work, you will have to add them to your :extra-paths in :test alias. Hopefully this issue is going to be fixed and RCF will work out of the box, just by providing the JVM option.".

@tengstrand
Copy link
Collaborator

I added the link to issue 448 because it was mentioned in the workspace that reproduced this issue. But yes, you are right Sean that it wasn't directly related to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Not a bug but an improvement to overall user experience
Projects
None yet
Development

No branches or pull requests

6 participants