Autoscaling your FastAPI apps has significant benefits like improved performance, optimized resource usage, and cost-effectiveness. It lets you efficiently manage traffic spikes and varying loads, ensuring your application remains responsive at all times.
The fastapi-serve
library has built-in support for auto-scaling based on CPU usage. You can configure the CPU threshold for scaling up and down and the maximum number of replicas by specifying them in a jcloud.yml
file and using the --config
flag to give it to the deployment.
# jcloud.yml
instance: C3
autoscale:
min: 1
max: 2
metric: cpu
target: 40
The above configuration will scale the app up to a maximum of 2 replicas when CPU usage exceeds 40%, and scale it down to 1 replica when the CPU usage falls below 40%.
Let's look at an example of how to auto-scale a FastAPI app based on CPU usage.
This directory contains the following files:
.
โโโ main.py # The FastAPI app
โโโ jcloud.yml # JCloud deployment config with the autoscaling config
โโโ README.md # This README file
# main.py
import os
import time
from fastapi import FastAPI
from pydantic import BaseModel, Field
app = FastAPI()
class Response(BaseModel):
cpu_time: float
result: int
hostname: str = Field(default_factory=lambda: os.environ.get("HOSTNAME", "unknown"))
def _heavy_compute(count):
sum = 0
for i in range(count):
sum += i
return sum
@app.get("/load/{count}", response_model=Response)
def load_test(count: int = 1_000_000):
_t1 = time.time()
_sum = _heavy_compute(count)
_t2 = time.time()
_cpu_time = _t2 - _t1
print(f"CPU time: {_cpu_time}")
return Response(cpu_time=_cpu_time, result=_sum)
In the above example, we have a /load
endpoint that performs a CPU-intensive task. We will use this endpoint to simulate a CPU-intensive workload.
fastapi-serve deploy jcloud main:app
โญโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ App ID โ fastapi-2a94b25a5f โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Phase โ Serving โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Endpoint โ https://fastapi-2a94b25a5f.wolf.jina.ai โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ App logs โ https://cloud.jina.ai/ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Base credits (per hour) โ 10.104 (Read about pricing here) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Swagger UI โ https://fastapi-2a94b25a5f.wolf.jina.ai/docs โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ OpenAPI JSON โ https://fastapi-2a94b25a5f.wolf.jina.ai/openapi.json โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Let's send a few requests to the /load
endpoint to simulate a not-so-intense workload.
curl -sX GET https://fastapi-2a94b25a5f.wolf.jina.ai/load/1000000 | jq
{
"cpu_time": 0.4925811290740967,
"result": 499999500000,
"hostname": "gateway-00001-deployment-85589655bb-pn7b4"
}
This finishes in about 49ms. Let's send one request with an intense workload.
curl -sX GET https://fastapi-2a94b25a5f.wolf.jina.ai/load/10000000000 | jq
While the request is being processed, you can see the CPU usage in the CPU
graph. It will go above 40%, and the app will be scaled up to 2 replicas. Meanwhile, let's open another terminal and send a few more requests to the /load
endpoint in a loop.
for i in {1..1000}; do curl -sX GET https://fastapi-2a94b25a5f.wolf.jina.ai/load/1000000 | jq; sleep 0.5; done
Eventually, you will see that requests are being served by 2 replicas (indicated in the hostname
field in the response).
{
"cpu_time": 0.11650848388671875,
"result": 499999500000,
"hostname": "gateway-00001-deployment-85589655bb-pn7b4"
}
{
"cpu_time": 0.1402430534362793,
"result": 499999500000,
"hostname": "gateway-00001-deployment-85589655bb-gr6sc"
}
Note: You might see a message saying "The upstream server is timing out" during long-running requests. This can be configured with the
timeout
field in thejcloud.yml
file. By default, requests will time out after 120 seconds.
To view the CPU usage, you can go Jina AI Cloud. Click on the fastapi-2a94b25a5f
app and then click on the Charts
tab. You can see the CPU usage in the CPU
graph.
As we've seen in this example, CPU-based autoscaling can be a game changer for FastAPI applications. It helps to efficiently manage your resources, handle traffic spikes, and maintain a responsive application under heavy workloads. fastapi-serve
makes it straightforward to leverage autoscaling, helping you to build highly scalable, efficient, and resilient FastAPI applications with ease. Embrace the power of autoscaling with fastapi-serve
today!