Skip to content

Latest commit

ย 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
ย 
ย 
ย 
ย 
ย 
ย 

๐Ÿ’น Auto-Scaling Endpoints Based on CPU Usage

Autoscaling your FastAPI apps has significant benefits like improved performance, optimized resource usage, and cost-effectiveness. It lets you efficiently manage traffic spikes and varying loads, ensuring your application remains responsive at all times.

The fastapi-serve library has built-in support for auto-scaling based on CPU usage. You can configure the CPU threshold for scaling up and down and the maximum number of replicas by specifying them in a jcloud.yml file and using the --config flag to give it to the deployment.

# jcloud.yml
instance: C3
autoscale:
  min: 1
  max: 2
  metric: cpu
  target: 40

The above configuration will scale the app up to a maximum of 2 replicas when CPU usage exceeds 40%, and scale it down to 1 replica when the CPU usage falls below 40%.

Let's look at an example of how to auto-scale a FastAPI app based on CPU usage.

๐Ÿ“ˆ Deploy a FastAPI app with auto-scaling based on CPU usage

This directory contains the following files:

.
โ”œโ”€โ”€ main.py             # The FastAPI app
โ”œโ”€โ”€ jcloud.yml          # JCloud deployment config with the autoscaling config
โ””โ”€โ”€ README.md           # This README file
# main.py
import os
import time

from fastapi import FastAPI
from pydantic import BaseModel, Field

app = FastAPI()

class Response(BaseModel):
    cpu_time: float
    result: int
    hostname: str = Field(default_factory=lambda: os.environ.get("HOSTNAME", "unknown"))

def _heavy_compute(count):
    sum = 0
    for i in range(count):
        sum += i
    return sum

@app.get("/load/{count}", response_model=Response)
def load_test(count: int = 1_000_000):
    _t1 = time.time()
    _sum = _heavy_compute(count)
    _t2 = time.time()
    _cpu_time = _t2 - _t1
    print(f"CPU time: {_cpu_time}")
    return Response(cpu_time=_cpu_time, result=_sum)

In the above example, we have a /load endpoint that performs a CPU-intensive task. We will use this endpoint to simulate a CPU-intensive workload.

๐Ÿš€ Deploying to Jina AI Cloud

fastapi-serve deploy jcloud main:app
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ App ID                  โ”‚                             fastapi-2a94b25a5f                              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Phase                   โ”‚                                   Serving                                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Endpoint                โ”‚                   https://fastapi-2a94b25a5f.wolf.jina.ai                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ App logs                โ”‚                           https://cloud.jina.ai/                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Base credits (per hour) โ”‚                      10.104 (Read about pricing here)                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Swagger UI              โ”‚                https://fastapi-2a94b25a5f.wolf.jina.ai/docs                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ OpenAPI JSON            โ”‚            https://fastapi-2a94b25a5f.wolf.jina.ai/openapi.json             โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

๐Ÿ’ป Testing

Let's send a few requests to the /load endpoint to simulate a not-so-intense workload.

curl -sX GET https://fastapi-2a94b25a5f.wolf.jina.ai/load/1000000 | jq
{
  "cpu_time": 0.4925811290740967,
  "result": 499999500000,
  "hostname": "gateway-00001-deployment-85589655bb-pn7b4"
}

This finishes in about 49ms. Let's send one request with an intense workload.

curl -sX GET https://fastapi-2a94b25a5f.wolf.jina.ai/load/10000000000 | jq

While the request is being processed, you can see the CPU usage in the CPU graph. It will go above 40%, and the app will be scaled up to 2 replicas. Meanwhile, let's open another terminal and send a few more requests to the /load endpoint in a loop.

for i in {1..1000}; do curl -sX GET https://fastapi-2a94b25a5f.wolf.jina.ai/load/1000000 | jq; sleep 0.5; done

Eventually, you will see that requests are being served by 2 replicas (indicated in the hostname field in the response).

{
  "cpu_time": 0.11650848388671875,
  "result": 499999500000,
  "hostname": "gateway-00001-deployment-85589655bb-pn7b4"
}
{
  "cpu_time": 0.1402430534362793,
  "result": 499999500000,
  "hostname": "gateway-00001-deployment-85589655bb-gr6sc"
}

Note: You might see a message saying "The upstream server is timing out" during long-running requests. This can be configured with the timeout field in the jcloud.yml file. By default, requests will time out after 120 seconds.

๐Ÿ“Š Observe the CPU usage

To view the CPU usage, you can go Jina AI Cloud. Click on the fastapi-2a94b25a5f app and then click on the Charts tab. You can see the CPU usage in the CPU graph.

Example CPU Usage

๐ŸŽฏ Wrapping Up

As we've seen in this example, CPU-based autoscaling can be a game changer for FastAPI applications. It helps to efficiently manage your resources, handle traffic spikes, and maintain a responsive application under heavy workloads. fastapi-serve makes it straightforward to leverage autoscaling, helping you to build highly scalable, efficient, and resilient FastAPI applications with ease. Embrace the power of autoscaling with fastapi-serve today!