[💡FEATURE REQUEST]: Adaptive scaling of the workers #97

stefanos82 · 2019-01-17T22:31:03Z

With PHP-FPM we have three options: static, dynamic, and ondemand.

Can we accomplish such thing with rr? I don't think it makes sense to waste resources when a website is more or less in idle mode; It should be able to limits its workers to the lowest level possible for obvious reasons.

Thoughts and / or suggestions?

The text was updated successfully, but these errors were encountered:

Alex-Bond · 2019-01-18T01:22:15Z

@stefanos82 hi! The idea of RR is that we keep in memory as much code as possible and don't have to bootstrap system on each request.
PHP-FPM working differently. They spawning workers but didn't do anything until you make the request and after it destroys worker.
From my perspective, it doesn't make sense to create a dynamic amount of workers because it will kill mail purpose of the system.

wolfy-j · 2019-01-18T09:38:12Z

This is possible and has been planned from the beginning (that's why worker pool has name StaticPool and interfaced for usage in Server).

The only trick in this feature is to properly define scaling logic to push/pull workers from allocation channel. We are very actively talking about this feature internally, cos if it's done wrong effect on application can be very harmful.

stefanos82 · 2019-01-18T15:11:04Z

Hmmm, I see.

Well, since we care about performance, maybe we should have figured the number of workers based on CPU cores? That's also an option I would say.

package main

import (  
    "fmt"
    "runtime"
)

func main() {
    fmt.Println(runtime.NumCPU())
}

wolfy-j · 2019-01-18T15:18:16Z

I can only set it as default value for the pool.numWorkers option.

stefanos82 · 2019-01-18T15:22:22Z

OK, what if I would like to increase or decrease the number of CPU workers on the fly. Is there any hotkey for this option?

wolfy-j · 2019-01-18T15:32:22Z

Currently this API is not exposed, but it is possible to configure pool with different configuration: https://github.com/spiral/roadrunner/blob/master/server.go#L113

stefanos82 · 2019-01-18T16:05:09Z

You mean to use Reconfigure...when exactly? I'm referring at runtime.

If for instance, I run rr but I realize after a while that my traffic needs more workers.

Am I forced to stop it, increase the number for workers in .rr.yaml, and then restart it?

wolfy-j · 2019-01-18T16:07:58Z

Currently yes, Reconfigure is what used in http:reset which you call in runtime (without stopping the server). I guess I can add a flag to alter number of workers for this function.

stefanos82 · 2019-01-18T16:09:40Z

That would be more than awesome.

wolfy-j · 2019-01-25T21:34:53Z

After couple of intense internal discussions (thank you @ValeryPiashchynski, Andrew M, Alexei N, and @vvval ) we have come up with a plan of adding basic balancing mechanism based on 2 derivative metrics - allocation time and processing time. Thought, more metrics can be added in a future, this two should cover a lot of possible use cases such as a lot of fast queries, few amount of large queries and so on.

If anyone have anything to share regarding adaptive scaling mechanism algorithms we are glad to listen.

stefanos82 · 2019-01-25T21:50:12Z

It would be a lot helpful if you could expand more on this, much like a case study, what led you to choose these two derivative metrics, and so forth.

I could investigate it and see whether there is a better alternative that could be applied.

wolfy-j · 2019-01-25T22:22:41Z

Well, cost they both derivative :) Each metric depends on CPU load, number of connections, processing time and etc.

allocation time = how long you have to wait to get free worker (window average).
processing time = hold long you have to wait to get your job done (window average).

In theory, even one of this metrics should include enough information to scale system up and down.

I will try to explain couple of scenarios (green = processing time, orange = allocation time):

processing time is high but allocation time is low
system accepting heavy requests on low/medium rate, no need to scale
processing time is high, allocation time is growing
system is accepting more (number of) heavy requests than previously, it's good time to scale up
processing time is high, allocation time is high
system is accepting heavy requests on high rate. System can only scale here if CPU/memory is available, otherwise the system is saturated.

-- please do not consider this whole chart as the timeline for the app, it's an example ---

processing time is low, allocation time is high
system is accepting a lot of small requests on a high rate, we can scale up if CPU/memory is available.
low processing time, low allocation time
you are running hello world application. :)

This is not final, we are still having the discussions and open to suggestions, this is our first shoot (before the implementation). I'm ready to accept that we have the fatal flaw in this logic, however, this metrics are easy to retrieve and process, so they looks promising for first version of adaptive scaling mechanism.

Clearly both metrics should be used in combination with CPU, memory stats, min/max boundaries and proper hysteresis logic. Also we have to consider the cost of worker creation.

wolfy-j · 2019-01-25T22:32:20Z

I believe we can also calculate 2nd derivative to build better prediction logic, but this is not type of rabbit hole I would like to jump into... yet.

stefanos82 · 2019-01-26T14:10:13Z

Very informative. Now I have a clearer view about the whole thing, thank you.

stefanos82 · 2019-01-29T01:08:23Z

This article could be used as a source of inspiration: Building a Worker Pool in Golang

It does not mean it demonstrates 100% what I have suggested, but the concept around dynamism is demonstrated in it.

Please bear in mind that there is a high possibility that I'm wrong about the article's content and that I most probably have misunderstood its concept.

Nevertheless, it's a very informative article that makes you appreciate the use of channels.

wolfy-j · 2020-04-11T09:58:09Z

https://books.google.by/books?id=OoX0BwAAQBAJ&pg=PA146&lpg=PA146&dq=realtime+balancing+algos&source=bl&ots=OSCB1TbdJt&sig=ACfU3U0T39kPXUUCvb-usC1nco7nwWSDZw&hl=en&sa=X&ved=2ahUKEwi07MX2jeDoAhVOUZoKHdiXBHkQ6AEwAnoECAsQLw#v=onepage&q&f=false

stefanos82 · 2020-04-11T11:05:05Z

@wolfy-j It's not visible for me I'm afraid.

Can you take a screenshot and paste it here please?

rustatian · 2021-06-06T10:10:58Z

Idea:

Measure a time in the ServeHTTP function. That will be a time when a request arrived at us.
Measure an exit time, when the worker released.

The sub of those timers will give us a piece of information about how long the request waited for the actual execution (we may also sub an execution time, where the worker was in the process of request execution).
In the configuration, the user will be able to set threshold value as well with the max workers number, cooldown timeout, and step.
For example:

Request arrived -> start the timer.
WorkerWatcher released a worker for the request.
Actual execution in the Pool <-- this time we should subtract, because we need a RR processing time.
Worker returned to the WorkerWatcher -> stop the timer.

Results (example): 3 seconds waiting for the worker, 500ms actual work in the PHP worker. 3-0.5=2.5s (data science here), threshold - 1s --> decision: allocate a worker (or 2-3-5 according to the step but no more than max). Also need to handle negative time in a case when waiting for the worker, after scale, would be smaller, than the actual execution.

Next request: 2 seconds waiting for the worker (time reduced for example), 500ms execution --> decision: allocate a worker.
Next request: 1 second. Smaller than specified in the configuration, skip.
Next request: 500ms. Smaller than specified in the configuration, cooldown timeout expired --> decision: deallocate a worker (step).

Kaspiman · 2023-05-22T11:39:20Z

Hello! How is work progressing in this feature? I see that the label "v2023.1.0" has been removed. Is the feature still planned?

rustatian · 2023-05-22T11:43:44Z

Not planned ATM. Still not sure the RR should have this feature in the modern era of K8s and other orchestration tools that can scale pods on demand. From the RR side, it provides metrics to make the decision about scaling (like queue size) for the orchestration tools.

rustatian · 2024-11-24T23:14:04Z

Reopening, in the next release (v2024.3) RR would have a DynamicPool configuration in addition to the static pool. Via this configuration, you may specify a max dynamic workers count, idle timeout (if no-workers won't be triggered during this timeout, all dynamically allocated workers would be gracefully de-allocated) + allocation step (spawn rate).

How it works:
Currently, pool.allocate_timeout option is responsible for the worker waiting timeout. After this timeout, the request would be dropped with the error - NoFreeWorkers. With the dynamic pool, RR instead allocates additional workers, some of them go to the pool (up to max_workers) and one handles the request. Any feedback on how this feature should work is highly appreciated.

Kaspiman · 2024-11-25T03:00:22Z

Wow, what a gift!

wolfy-j added the C-enhancement Category: enhancement. Meaning improvements of current module, transport, etc.. label Jan 18, 2019

wolfy-j added the help wanted label Jan 18, 2019

wolfy-j added performance and removed C-enhancement Category: enhancement. Meaning improvements of current module, transport, etc.. labels Jan 25, 2019

wolfy-j changed the title ~~PHP workers much like PHP-FPM?~~ Adaptive workers scaling Jan 25, 2019

wolfy-j removed the help wanted label Jan 25, 2019

wolfy-j changed the title ~~Adaptive workers scaling~~ Adaptive scaling of the workers Jan 25, 2019

wolfy-j changed the title ~~Adaptive scaling of the workers~~ [research] Adaptive scaling of the workers Jan 25, 2019

wolfy-j changed the title ~~[research] Adaptive scaling of the workers~~ Adaptive scaling of the workers Jan 27, 2019

wolfy-j mentioned this issue Mar 2, 2019

Performance degradation on production #132

Closed

rustatian added B-performance Bug: performance issues and removed performance labels Feb 15, 2020

rustatian added this to the unplanned milestone Feb 19, 2020

rustatian mentioned this issue Oct 14, 2020

Tracking issue for the RoadRunner 2.0 #368

Closed

33 tasks

rustatian modified the milestones: 2.0.0, next Jan 12, 2021

rustatian changed the title ~~Adaptive scaling of the workers~~ [RR2] Adaptive scaling of the workers Jan 12, 2021

rustatian self-assigned this Feb 16, 2021

rustatian added help-needed-needs-mcve Call for participation: This issue needs a Minimal Complete and Verifiable Example Y-Low Priority: Low labels Feb 16, 2021

rustatian removed this from the next milestone Mar 28, 2021

rustatian added this to General Jan 15, 2022

rustatian moved this to Backlog in General Jan 15, 2022

rustatian changed the title ~~[RR2] Adaptive scaling of the workers~~ [💡FEATURE REQUEST]: Adaptive scaling of the workers Feb 17, 2022

rustatian mentioned this issue Nov 1, 2022

[💡 FEATURE REQUEST]: RPC method to manage process num #1345

Closed

rustatian added this to the v2023.1 milestone Jan 13, 2023

rustatian added Y-Medium Priority: Medium and removed Y-Low Priority: Low labels Jan 13, 2023

rustatian removed this from the v2023.1.0 milestone Mar 12, 2023

stefanos82 closed this as completed Jul 12, 2023

github-project-automation bot moved this from Backlog to Unreleased in General Jul 12, 2023

rustatian moved this from Unreleased to Done in General Jul 13, 2023

rustatian reopened this Nov 24, 2024

rustatian added this to Jira 😄 Nov 24, 2024

rustatian moved this to 🏗 In progress in Jira 😄 Nov 24, 2024

rustatian added this to the v2024.3 milestone Nov 24, 2024

rustatian linked a pull request Nov 27, 2024 that will close this issue

feature: dynamic pool roadrunner-server/pool#12

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[💡FEATURE REQUEST]: Adaptive scaling of the workers #97

[💡FEATURE REQUEST]: Adaptive scaling of the workers #97

stefanos82 commented Jan 17, 2019

Alex-Bond commented Jan 18, 2019

wolfy-j commented Jan 18, 2019

stefanos82 commented Jan 18, 2019

wolfy-j commented Jan 18, 2019

stefanos82 commented Jan 18, 2019

wolfy-j commented Jan 18, 2019

stefanos82 commented Jan 18, 2019

wolfy-j commented Jan 18, 2019

stefanos82 commented Jan 18, 2019

wolfy-j commented Jan 25, 2019 •

edited

Loading

stefanos82 commented Jan 25, 2019

wolfy-j commented Jan 25, 2019 •

edited

Loading

wolfy-j commented Jan 25, 2019 •

edited

Loading

stefanos82 commented Jan 26, 2019

stefanos82 commented Jan 29, 2019 •

edited

Loading

wolfy-j commented Apr 11, 2020

stefanos82 commented Apr 11, 2020

rustatian commented Jun 6, 2021 •

edited

Loading

Kaspiman commented May 22, 2023

rustatian commented May 22, 2023

rustatian commented Nov 24, 2024

Kaspiman commented Nov 25, 2024

[💡FEATURE REQUEST]: Adaptive scaling of the workers #97

[💡FEATURE REQUEST]: Adaptive scaling of the workers #97

Comments

stefanos82 commented Jan 17, 2019

Alex-Bond commented Jan 18, 2019

wolfy-j commented Jan 18, 2019

stefanos82 commented Jan 18, 2019

wolfy-j commented Jan 18, 2019

stefanos82 commented Jan 18, 2019

wolfy-j commented Jan 18, 2019

stefanos82 commented Jan 18, 2019

wolfy-j commented Jan 18, 2019

stefanos82 commented Jan 18, 2019

wolfy-j commented Jan 25, 2019 • edited Loading

stefanos82 commented Jan 25, 2019

wolfy-j commented Jan 25, 2019 • edited Loading

wolfy-j commented Jan 25, 2019 • edited Loading

stefanos82 commented Jan 26, 2019

stefanos82 commented Jan 29, 2019 • edited Loading

wolfy-j commented Apr 11, 2020

stefanos82 commented Apr 11, 2020

rustatian commented Jun 6, 2021 • edited Loading

Kaspiman commented May 22, 2023

rustatian commented May 22, 2023

rustatian commented Nov 24, 2024

Kaspiman commented Nov 25, 2024

wolfy-j commented Jan 25, 2019 •

edited

Loading

wolfy-j commented Jan 25, 2019 •

edited

Loading

wolfy-j commented Jan 25, 2019 •

edited

Loading

stefanos82 commented Jan 29, 2019 •

edited

Loading

rustatian commented Jun 6, 2021 •

edited

Loading