stateful deployments: find feasible node for sticky host volumes #24558

pkazmierczak · 2024-11-27T18:31:14Z

This changeset implements node feasibility checks for sticky host volumes.

scheduler/feasible.go

scheduler/generic_sched.go

tgross · 2024-12-11T15:31:02Z

scheduler/feasible.go

+				// belonging to the job in question that are sticky, have the
+				// volume IDs that match what the node offers and should not
+				// end up in this check?
+				allocs, err := h.ctx.ProposedAllocs(n.ID)


The only other place we use ctx.ProposedAllocs for feasibility checking is in the distinct hosts checker, because in that case we need to know about all the allocations on the node, including those that might already exist. Do we really want to look at the old allocs here?

Right. I'm just wondering what's the best way to get the full allocation data from here? hasVolumes has access to the state store, but it's unaware of allocation ID or namespace. I could just get all the allocs from the state store and filter by node/terminal status etc? I thought perhaps looking at ProposedAllocs would be simpler.

True, these allocs are full structs.Allocation and include the proposed allocs that aren't in the state store yet, so I guess we do want to use this list as our starting point. But I'm pretty sure this list includes allocs not in the job, which might not have volume requests at all. But maybe that's ok as long as we're checking for the empty slice?

Having a test case that covers unrelated allocs or allocs that need different volumes would go a long way towards building our confidence that this is the right approach.

4b3f06c introduces a new approach, based on allocation IDs.

scheduler/feasible.go

nomad/structs/structs.go

tgross · 2024-12-11T15:34:32Z

scheduler/feasible_test.go

+		name   string
+		node   *structs.Node
+		expect bool
+	}{


We should include a case for a sticky volume request that hasn't been previously claimed.

Co-authored-by: Tim Gross <[email protected]>

pkazmierczak · 2024-12-16T18:27:20Z

@tgross I addressed your comments. Here's what we do now:

we store hostVolumeIDs instead of allocIDs like you suggested—it's much easier indeed
in computePlacements we either add the volume IDs to allocations or carry them over if they had them before.

I added some tests and I think this is going the right direction. What do you think?

tgross

I think this is starting to come together @pkazmierczak. If you can, it'd be worth taking a moment to building it and doing some bench testing with real deployments, alloc failures, etc. to make sure we're not missing cases where the volumes should be getting handed off.

scheduler/generic_sched.go

tgross · 2024-12-16T18:44:52Z

scheduler/feasible.go


 		} else {
-			h.volumeReqs = append(h.volumeReqs, req)
+			h.volumeReqs = append(h.volumeReqs, &allocVolumeRequest{hostVolumeIDs: allocHostVolumeIDs, volumeReq: req})


If we're assigning the same []string to every request, why not have it on the HostVolumeChecker instead of on individual requests like this? It's only copying around the trio of pointers for a slice one or two extra times but it grabs my attention as something I'm maybe missing?

tgross · 2024-12-16T19:24:08Z

scheduler/stack.go

@@ -349,7 +351,7 @@ func (s *SystemStack) Select(tg *structs.TaskGroup, options *SelectOptions) *Ran
 	s.taskGroupDrivers.SetDrivers(tgConstr.drivers)
 	s.taskGroupConstraint.SetConstraints(tgConstr.constraints)
 	s.taskGroupDevices.SetTaskGroup(tg)
-	s.taskGroupHostVolumes.SetVolumes(options.AllocName, s.jobNamespace, tg.Volumes)
+	s.taskGroupHostVolumes.SetVolumes(options.AllocName, s.jobNamespace, tg.Volumes, options.AllocationHostVolumeIDs)


~~This raises a question: is "sticky" even meaningful for system and sysbatch jobs?~~ It is because of CSI volumes, nevermind 😁

scheduler/stack.go

scheduler/generic_sched_test.go

tgross force-pushed the dynamic-host-volumes branch from e2cd778 to 3686951 Compare December 2, 2024 14:12

pkazmierczak force-pushed the stateful-deployments branch from 89c9b03 to 6e8c50f Compare December 2, 2024 14:54

vercel bot deployed to Preview – nomad-ui December 2, 2024 14:55 View deployment

pkazmierczak force-pushed the stateful-deployments branch from 6e8c50f to 49428c3 Compare December 3, 2024 07:58

vercel bot deployed to Preview – nomad-ui December 3, 2024 08:00 View deployment

vercel bot deployed to Preview – nomad-ui December 3, 2024 10:06 View deployment

tgross force-pushed the dynamic-host-volumes branch from bef9714 to 8c3d8fe Compare December 3, 2024 19:11

pkazmierczak force-pushed the stateful-deployments branch from bd50ea9 to a450389 Compare December 4, 2024 16:18

vercel bot deployed to Preview – nomad-ui December 4, 2024 16:19 View deployment

pkazmierczak force-pushed the stateful-deployments branch from a450389 to 7ddcb3d Compare December 5, 2024 12:14

vercel bot deployed to Preview – nomad-ui December 5, 2024 12:15 View deployment

pkazmierczak changed the title ~~Stateful deployments~~ stateful deployments: host volumes Dec 5, 2024

vercel bot deployed to Preview – nomad-ui December 5, 2024 15:33 View deployment

vercel bot deployed to Preview – nomad-ui December 9, 2024 15:25 View deployment

tgross force-pushed the dynamic-host-volumes branch from 1f8c378 to a447645 Compare December 9, 2024 20:27

pkazmierczak force-pushed the stateful-deployments branch from 196d624 to ff3a794 Compare December 10, 2024 18:32

vercel bot deployed to Preview – nomad-ui December 10, 2024 18:34 View deployment

tgross reviewed Dec 10, 2024

View reviewed changes

scheduler/feasible.go Outdated Show resolved Hide resolved

scheduler/generic_sched.go Outdated Show resolved Hide resolved

pkazmierczak force-pushed the stateful-deployments branch from c4f5d9a to 43f19f8 Compare December 11, 2024 08:35

vercel bot deployed to Preview – nomad-ui December 11, 2024 08:37 View deployment

pkazmierczak changed the title ~~stateful deployments: host volumes~~ stateful deployments: find feasible node for sticky host volumes Dec 11, 2024

vercel bot deployed to Preview – nomad-ui December 11, 2024 15:12 View deployment

pkazmierczak marked this pull request as ready for review December 11, 2024 15:18

pkazmierczak requested review from a team as code owners December 11, 2024 15:18

pkazmierczak requested review from gulducat and tgross December 11, 2024 15:22

tgross reviewed Dec 11, 2024

View reviewed changes

vercel bot deployed to Preview – nomad-ui December 11, 2024 15:42 View deployment

vercel bot deployed to Preview – nomad-ui December 12, 2024 17:07 View deployment

pkazmierczak and others added 16 commits December 16, 2024 17:34

CSI vols can be sticky too

b610f5e

refactor hasVolumes

ba82ddc

findPreferredNode

c1a11ff

separate CSI and host volumes

3cfb7ce

accidental git snafu

0664e7c

correct findPreferredNode

e0be27e

Tim's comment

d9dbecf

hasVolumes

1857bbf

simplify

955ce28

hasVolumes and tests

347287c

Update nomad/structs/structs.go

f5d3eda

Co-authored-by: Tim Gross <[email protected]>

don't return too early

4ae311f

make alloc ID available to the host volume checker

7db247e

fix returns

a0ec4c8

adjust computePlacements

26d7cd5

Tim's comments

d3bece0

pkazmierczak force-pushed the stateful-deployments branch from 137085b to d3bece0 Compare December 16, 2024 16:34

cleanup feasible.go

f9caa31

vercel bot deployed to Preview – nomad-ui December 16, 2024 16:37 View deployment

pkazmierczak added 2 commits December 16, 2024 17:39

clean up reconciler

cdff86b

test

4dd3ada

vercel bot deployed to Preview – nomad-ui December 16, 2024 18:19 View deployment

don't need CNI here

324e17a

vercel bot deployed to Preview – nomad-ui December 16, 2024 18:22 View deployment

extra check

bfd9fbd

vercel bot deployed to Preview – nomad-ui December 16, 2024 18:24 View deployment

tgross reviewed Dec 16, 2024

View reviewed changes

Tim's comments

24ab149

vercel bot deployed to Preview – nomad-ui December 16, 2024 20:02 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stateful deployments: find feasible node for sticky host volumes #24558

stateful deployments: find feasible node for sticky host volumes #24558

pkazmierczak commented Nov 27, 2024 •

edited

Loading

tgross Dec 11, 2024

pkazmierczak Dec 11, 2024

tgross Dec 11, 2024

pkazmierczak Dec 12, 2024

tgross Dec 11, 2024

pkazmierczak commented Dec 16, 2024

tgross left a comment

tgross Dec 16, 2024

tgross Dec 16, 2024

stateful deployments: find feasible node for sticky host volumes #24558

Are you sure you want to change the base?

stateful deployments: find feasible node for sticky host volumes #24558

Conversation

pkazmierczak commented Nov 27, 2024 • edited Loading

tgross Dec 11, 2024

Choose a reason for hiding this comment

pkazmierczak Dec 11, 2024

Choose a reason for hiding this comment

tgross Dec 11, 2024

Choose a reason for hiding this comment

pkazmierczak Dec 12, 2024

Choose a reason for hiding this comment

tgross Dec 11, 2024

Choose a reason for hiding this comment

pkazmierczak commented Dec 16, 2024

tgross left a comment

Choose a reason for hiding this comment

tgross Dec 16, 2024

Choose a reason for hiding this comment

tgross Dec 16, 2024

Choose a reason for hiding this comment

pkazmierczak commented Nov 27, 2024 •

edited

Loading