This experiment allows you to find the optimal core and worker threads configuration that delivers the best performance in terms of throughput and guaranteeing response time for latency sensitive workloads.
The experiment helps user to find optimal number of memcached worker threads (typically set with the -t
flag for memcached
) by running memcached as the latency sensitive workload with various worker threads configuration.
Download and execute script located at vagrant/provision_experiment_environment.sh
.
The script will download all the necessary binaries, install Snap, Kubernetes, etcd, Docker and best effort workloads.
This parameter specifies desired maximum capacity for memcached. The unit is queries per second. Peak load is the only required parameter. Its value depends on resources available. Consider following formula as a rule of thumb:
peak_load = number of maximum threads dedicated to memcached * 100k
number of maximum threads dedicated to memcached
is set to number of physical cores available on machine by default. It can be overridden with -max-threads
flag.
- Run the experiment:
sudo optimal-core-allocation -experiment_peak_load=800000 > uuid.txt
- To display results of the experiment, you need to copy
experiment id
(just runcat uuid.txt
) and pass it as a parameter to python functions on Jupyter notebook:
from generic import optimal_core_allocation
optimal_core_allocation("ca24aa5d-4a88-7258-6f00-5f651b0d6515", slo=500) # 500us as latency SLO
Note that you need to specify your latency SLO in microseconds.
- (Optional) If you want see more detailed results regarding latency/throughput.
from swan import Experiment, OptimalCoreAllocation
exp = Experiment("ca24aa5d-4a88-7258-6f00-5f651b0d6515")
core = OptimalCoreAllocation(exp, slo=500)
core.latency()
core.qps()
core.cpu() # if run with USE snap collector
This example shows how to run the experiment using a configuration file.
All experiment flags can be provided using a configuration file. Command line flag -foo_bar
is equal to FOO_BAR
option in configuration file.
- Generate default configuration:
sudo optimal-core-allocation -config-dump >example-configuration.ini
- Modify configuration to meet your requirements:
$EDITOR example-configuration.ini
- Run experiment with your configuration:
sudo optimal-core-allocation -config example-configuration.ini
- Run the experiment overriding configuration file values using flags (it's much easier to modify few settings using command line flags instead of modifying configuration file):
sudo optimal-core-allocation -config example-configuration.ini -experiment_peak_load=800000 \
-cassandra_address=cassandra1 \
-experiment_mutilate_master_address=lg1 -experiment_mutilate_agent_addresses=lg2,lg3 \
-remote_ssh_login=username -remote_ssh_key_path=/home/username/.ssh/id_rsa
where:
cassandra1
is the address of cassandra database,lg*
are the names of hosts dedicated to run the load generator cluster,remote_ssh_*
options pointing to the credentials (username and private key), used to deploy and run load generator cluster,
- You can run the experiment with memcached threads pinned to specified number of hardware threads (
-use-core-pinning
flag). - You can run the experiment with memcached patch that allows to pin worker threads to single CPU (flag:
-memcached_threads_affinity
flag). - You can run the experiment on automatically provisioned Kubernetes cluster (
-kubertenes*
flags).
- 1 node for running memcached - 8 cores, single socket Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz with 32GB RAM,
- 9 nodes for load generator cluster (1 master node and 8 agents),
- Linux distribution: CentOS 7 with 4.10 Linux kernel,
The tables below show how memcached capacity changes when amount of worker threads grows.
Each cell displays Memacached tail latency (99 percentile latency). Colors indicates violation (or lack of thereof) of SLO:
- green - no violation,
- yellow - tail latency between 101% and 150% of SLO,
- red - tail latency above 150% of SLO,
- gray - memcached was incapable of handling requested amount of QPS.
There are two dimensions:
- load (x axis) - fraction of peak load expressed in QPS (Queries Per Second)
- worker threads (y axis) - number of memcached worker threads (check memcached -t option for details).
The experiment has been run with 10 load points, peak load set to 1.5 million QPS, and SLO set to 500 us.
You can observe, that each additional memcached worker, up to 10 on 16 CPU machine, adds capacity in terms of throughput. Taking SLO into consideration this very configuration can handle 900.000 QPS while meeting tail latency requirements (equls or less than 500us). You will be able to utilize 11 threads on the node. Increasing number of logical threads further, doesn't improve performance and surprisingly (because of hyperthreading) can cause latency degradation.
After increasing the accepted tail latency to 3ms there we can choose from broader range of configurations. Less threads are necessary to handle requests, while relaxed target SLO is met.
In this case optimal configuration will utilize 7 threads and will be capable of handling more than 1.000.000 QPS with tail latency below 3ms.
The interpretation above shows that there is an trade off between throughput and latency. Having these results available you can easily decided how many resources you need to dedicate to memcached in order to meet capacity and SLO requirements.
In tha cases described above:
- sticking to strict latency requirements (SLO 500us) 11 threads are needed to achieve 900.000 QPS
- relaxing latency requirements (SLO 3ms) it is enough to dedicate just 7 threads to achieve 1.300.000 QPS
Examples of other experiment configurations that may allow to validate various environments:
- Limiting worker threads of memcached can prevent Linux scheduling balancing problems.
- Pinning each memcached worker thread to another CPU - to make sure, that workers are never moved moved away.
- Running experiment on Kubernetes cluster using Kubernetes isolation mechanisms.
The results can help to answer following questions:
- Can running service in containers on Kubernetes cluster cause performance degradation?
- Is performance improvement of thread pinning worth complexity?