Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor worker selection to facilitate adding multiple worker runtimes #6387

Merged
merged 10 commits into from
Jan 25, 2021

Conversation

aoldershaw
Copy link
Contributor

@aoldershaw aoldershaw commented Jan 2, 2021

What does this PR accomplish?

Bug Fix | Feature | Documentation | Refactor

Make it easier to support multiple runtimes in the future by removing the responsibility of selecing a worker from the worker.Client. This also makes it possible to schedule multiple containers on the same worker, which would be useful for Services.

Changes proposed by this PR:

As was shown in #5223, the most natural way to support new runtimes is by constructing a new implementation of worker.Client (defining how to RunTaskStep, RunGetStep, etc.), rather than trying to find a more fine-grained common interface between worker runtimes up-front.

It's worth noting that there exists a worker.Worker interface already which seems to address some concerns of multiple runtimes - however, to me, worker.Worker doesn't really have a clear identity as an interface, as many of the interface's methods are just delegation to other things (DB, VolumeClient), and aren't runtime specific (though VolumeClient possibly will be). There are a few methods (FindContainerByHandle, FindOrCreateContainer, Fetch?) that are runtime specific and should probably be moved to worker.Client eventually, but that's out of scope for this PR.

The prior responsibilities of worker.Client included selecting a worker.Worker from the worker.Pool, and then running some (Garden specific) logic, including calling some methods on the worker.Worker to create containers. e.g. the flow for a get step is:

exec.GetStep.Run(...)
    -> worker.Client.RunGetStep()
        -> worker.Pool.FindOrChooseWorkerForContainer()
            <- worker.Worker
        -> worker.Worker.Fetch(...)
            -> Garden specific stuff

Now, worker.Client just represents a single worker, and worker.Pool gives worker.Clients. Each step first selects a worker.Client via the pool, and then runs the appropriate step. The new flow is:

exec.GetStep.Run(...)
    -> worker.Pool.SelectWorker()
        <- worker.Client
    -> worker.Client.RunGetStep()
        -> worker.Worker.Fetch(...)
            -> Garden specific stuff

This can also enable running services for task steps, e.g.

exec.TaskStep.Run(...)
    -> worker.Pool.SelectWorker()
        <- worker.Client
    -> worker.Client.RunService()
    -> worker.Client.RunTaskStep() // same worker

Additionally, this adds an ArtifactSourcer interface for converting runtime.Artifacts to worker.ArtifactSources. It doesn't have a super strong identity IMO, but it means we don't need to hold duplicated info on the worker.ContainerSpec and hackily modify the spec when running steps.

Notes to reviewer:

Contributor Checklist

Reviewer Checklist

  • Code reviewed
  • Tests reviewed
  • Documentation reviewed
  • Release notes reviewed
  • PR acceptance performed
  • New config flags added? Ensure that they are added to the
    BOSH and
    Helm packaging; otherwise, ignored for
    the integration
    tests

    (for example, if they are Garden configs that are not displayed in the
    --help text).

@aoldershaw aoldershaw added refactor release/no-impact This is an issue that never affected released versions (i.e. a regression caught prior to shipping). misc labels Jan 2, 2021
@aoldershaw aoldershaw changed the title Select worker explicitly during step execution Refactor worker selection Jan 3, 2021
@aoldershaw aoldershaw changed the title Refactor worker selection Refactor worker selection to facilitate adding multiple worker runtimes Jan 3, 2021
(and volume creation), rather than the worker.Client.

the eventual goal is to convert worker.Client into a client for a single
worker, rather than all the workers. the current flow for e.g. a `get`
step is:

exec.GetStep.Run(...)
    -> worker.Client.RunGetStep()
        -> worker.Pool.FindOrChooseWorkerForContainer()
            <- worker.Worker
        -> worker.Worker.Fetch(...)
            -> Garden specific stuff

the problem with this is that `worker.Worker` is Garden specific.
`worker.Client` is the ideal entrypoint to supporting multiple runtimes
(e.g. Ciro's K8s runtime PoC modified `worker.Client` to be k8s specific
instead), but since a `worker.Client` is the interface to interact with
ALL workers, that meant breaking Garden-based workers. if instead, the
flow went like:

exec.GetStep.Run(...)
    -> worker.Pool.FindOrChooseWorkerClientForContainer()
        <- worker.Client
    -> worker.Client.RunGetStep()
        -> worker.Worker.Fetch(...)
            -> Garden specific stuff

...then it's easy to swap out the `worker.Client` with a k8s (or nomad,
or whatever) implementation.

this also makes it possible to run multiple things on the same worker
explicitly. e.g. for my services RFC, it'd be nice to say:

exec.TaskStep.Run(...)
    -> worker.Pool.FindOrChooseWorkerClientForContainer()
        <- worker.Client
    -> worker.Client.RunService()
    -> worker.Client.RunTaskStep() // same worker

Signed-off-by: Aidan Oldershaw <[email protected]>
since worker.Client will be scoped to a single worker, and finding
volumes requires a pool of workers, it streaming artifacts shouldn't be
tied to worker.Client.

rather than putting it on worker.Pool, create a new component that
depends on worker.Pool - this is because the API doesn't need artifact
streaming (via the artifact interface, at least), and we'd need to
provide a dummy compression implementation like before, which doesn't
make much sense - better to decouple the responsibilities

Signed-off-by: Aidan Oldershaw <[email protected]>
the desired result is to remove worker.Client's need to have a reference
to the worker.Pool (since worker.Pool will be constructing
worker.Clients in later commits). while we could pass a parent reference
to the worker.Client upon construction, that isn't so obvious -
requiring a tight interface in the method signatures is more clear IMO

Signed-off-by: Aidan Oldershaw <[email protected]>
use lagerctx instead

Signed-off-by: Aidan Oldershaw <[email protected]>
...by removing a method that needn't be on it. FindOrChooseWorker sounds
like it'd be used elsewhere, but it's just used internally to choose a
random worker for a volume - so, I gave it a more honest name and made
it private

Signed-off-by: Aidan Oldershaw <[email protected]>
...rather than in the worker.Client. there was some awkwardness in the
ContainerSpec. task/put inputs and (at times) images would be passed in
as runtime.Artifacts for the sole purpose of converting them to Sources
(which are also stored on the ContainerSpec). there was a TODO in the
worker.Client to this effect:

// don't modify spec inside here, Specs don't change after you write them

I considered just adding this responsibility to the worker.Pool (since
it already knows where to find volumes), but that feels like giving too
much responsibility to the Pool (and would require adding back the dummy
`compression` to the API's worker.Pool)

Signed-off-by: Aidan Oldershaw <[email protected]>
rather than in the worker.Client. the worker.Pool now gives a
worker.Client from the SelectWorker method, which makes it possible to
support different runtimes in parallel in the future (though, the
worker.Provider will need to handle this and return the appropriate
worker.Client instead of a worker.Worker).

selecting a worker for task steps has been moved to the TaskDelegate
since there's some funky logic around limit-active-tasks. I also
noticed/fixed a bug in this logic wherein the limit-active-tasks DB lock
would not be released under some failure cases.

side note, but this code really feels like DI/mock hell - it'd be nice
to clean this up/reduce the indirection, but I'll leave that for another
PR.

Signed-off-by: Aidan Oldershaw <[email protected]>
Copy link
Member

@taylorsilva taylorsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so excited about this! Thanks for all the extra work you did :)
There's a lot to go through so I hope I didn't miss anything. The Artifact Streamer and Sourcer seem like good extractions for the volume fetching issue. I did notice they didn't have any tests though. Think you could add some for them?

atc/engine/task_delegate.go Show resolved Hide resolved
@aoldershaw
Copy link
Contributor Author

@taylorsilva sorry, took me a while to find the time to address your comments! I've added some tests for ArtifactStreamer and ArtifactSourcer. You may notice they look quite a bit different from the existing tests in atc/worker. tbh, I really don't like how atc/worker (and its existing tests) are currently structured - there's the appearance of loose coupling because everything is behind an interface, but in reality, everything is tightly coupled to the current set of abstractions, so making changes becomes a pain - especially because the tests are so mock-heavy.

I kinda want to replace the test suites with slightly higher level tests and remove the convoluted web of dependencies within the package (since many of these interfaces aren't really interfaces - we just use interfaces so we can make testing slightly easier). Really, the garden worker implementation (i.e. most of atc/worker) has only 3 dependencies - Garden, Baggageclaim, and the DB - if we can use in-memory fakes of these (not just dumb counterfeiter stubs), I think the code would be a lot simpler and the tests would be more telling of whether the code actually works.

....but anyway, that's for another PR, this one's already big enough!

Copy link
Member

@taylorsilva taylorsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit about the one test's structure, but I'm gonna approve this anyways. Up to you if you wanna change it or not. Thanks for adding the test coverage :)

atc/worker/artifact_source_test.go Show resolved Hide resolved
@aoldershaw aoldershaw merged commit 1230c64 into master Jan 25, 2021
@aoldershaw aoldershaw deleted the worker-client-pool branch January 25, 2021 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
misc refactor release/no-impact This is an issue that never affected released versions (i.e. a regression caught prior to shipping).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants