- For incoming connections, ensure any stateful packetfilter on the system
has plenty of headroom regarding its table sizes:
- iptables conntrack_max/pf states
- todays systems have enough memory
- for iptables make sure conntrack_max and hashtable size match
- contrack_max = hashtable * 8 (if conntrack_max is set manually, see net/netfilter/nf_conntrack_core.c)
- e.g. 1048576 = 131072 * 8
- If there are spikes in the amount of incoming connections,
have larger accept queues per listener:
- kern.ipc.soacceptqueue=32768 (FreeBSD)
- net.core.somaxconn=32768 (Linux)
- iptables conntrack_max/pf states
- For outgoing connections, ensure there are plenty of free local TCP
ports available:
- net.inet.ip.portrange.first=1024 (FreeBSD)
- net.inet.ip.portrange.last=65535 (FreeBSD)
- net.ipv4.ip_local_port_range=1024 65535 (Linux)
- Handshake time directly affects application response latency and CPU
cycles consumption
- Use TLS 1.3
- three step handshake instead of four steps
TLS 1.2 (full handshake): TLS 1.3 (full handshake): Client Server Client Server ------------------------------ ------------------------------ 0) ---> TCP SYN ---> 0) ---> TCP SYN ---> 0) <--- TCP SYN ACK <--- 0) <--- TCP SYN ACK <--- 0) ---> TCP ACK ---> 0) ---> TCP ACK ---> 1) ---> Client Hello ---> 1) ---> Client Hello, 2) <--- Server Hello, Key Share ---> Certificate, 2) <--- Server Hello, Server Hello Done <--- Key Share, 3) ---> Cipher Key Exchange, Certificate, Change Cipher Spec, Certificate Verify, Finished ---> Finished <--- 4) <--- Change Cipher Spec, 3) ---> Finished, Finished <--- HTTP Request ---> 5) ---> HTTP Request ---> 4) <--- HTTP Response <--- 6) <--- HTTP Response <---
- three step handshake instead of four steps
- Use TLS session resumption
- default setting in newer HAproxy
- Nginx: ssl_session_cache, ssl_session_timeout
- If TLS session resumption is useful in reducing latency, consider
TCP Fast Open (RFC 7413)
- reduces TCP handshake to two steps on reconnect
- server side support: Linux, FreeBSD, HAproxy, Nginx
- Use 2048 bit RSA key pair instead of 4096 bit
- if security is of lesser and time of higher importance (unlike online payments for instance)
- Use TLS 1.3
- Announce Server Preferred Ciphers
- omit slow ciphers, 3DES
- have a look at
openssl speed
- consider using different compiler optimisation (-O3, default setting for OpenSSL)
- consider using AES-NI instructions in your CPU
- Intel E7-4830 v4 has them (check with
openssl version -a
)
- Intel E7-4830 v4 has them (check with
-
Bind SSL-proxy software to cpu cores
- have fewer processor crosscalls
- have fewer NUMA remote memory accesses
- quad socket Intel E7-4830 v4 is a NUMA system
- memory latencies:
- socket local: 135 ns
- remote: 194-202 ns (source)
-
HAproxy:
-
use option
nbproc 52
ornbproc 104
depending if Hyperthreading is used, leaving 4 cores (8 threads) for other processescpu-map 1 4 cpu-map 2 5 cpu-map 3 6 [...] cpu-map 50 55
-
maybe have four HAproxy master processes, one on each socket, with their child processes on the CPU cores of that socket
-
balance static-rr
might save some CPU cycles -
test if changes actually yield different performance (best testing is always in production, start removing nodes, watch metrics in Grafana or similar, stop once you reach obvious limits, analyze, improve)
-
At system level:
- prometheus node_exporter or collectd
- CPU usage and saturation
- sys, intr, user, nice, idle, iowait (Munin style)
- iptables conntrack/pf states
- network established/time_wait/fin_wait connections
- number of open files (correlates with network connections)
- network bandwidth
- network packets (for even workload correlates with network bandwith)
- context switches/interrupts (correlates with network packets)
- storage latency for logfiles
- maybe have a CPU usage graph showing distribution among cores
- maybe processor cross-calls (mpstat "xcal")
- if your workload has load spikes, consider sampling metrics at shorter intervals (1s), calcuate maximum, 99% percentile, median, and present that data to prometheus/collectd (prior aggregation)
- CPU usage and saturation
At application level:
- Curl your application from within your network, from your SSL-gateways,
from the outside world:
time_connect
time_pretransfer
time_total
- see
curl-stats.c
% ./curl-stats Total download seconds: 2.025787 Total pretransfer seconds: 1.453561 Name lookup seconds: 0.005250 Connect time seconds: 0.788709
- use this with an IP address, circumventing DNS latency, and watch connect time metric over time
Have some nice and sleek looking graphical frontend that people enjoy using before the trouble starts.
Grafana is slow, but can do it.
In general, we need three types of monitoring:
- long term statistics
- "How did it look like one year ago?"
- "Do we need to add more machines in a month?"
- current statistics
- "Did my application deploy go well or wrong?"
- This needs to be useable for developers
- immediate metrics
- "What is going on right now?"
top
,vmstat
,iostat
,mpstat
,perf
,dtrace
,tail -F logfile
, etc.- read Brendan Gregg's blog
HAproxy logs:
- logging to local file:
- expect large files
- don't keep old files
- skip compression during logrotation
- options:
- for less data:
option dontlog-normal
- for even lesser data:
http-request set-log-level silent if ...
- custom log-format:
%Th
- connection handshake time (SSL)- "100% of handshakes above 40ms are from the US."
%Ti
- idle time before the HTTP request- if handshake time is zero but idle time is high, we are reusing connections
- if this is written to logfiles, have some monitoring plugin
sampling from this logfile (
tail
) and calculate maximum, 99% percentile, median - it would be interesting to see how many transactions we have and how many required a fresh SSL handshake
- for less data:
- Syslog:
rsyslog
can do random log sampling, might miss rare interesting events - central logging:
- HAproxy → /var/log/haproxy.log → Filebeat → Elasticsearch → Kibana
- HAproxy → Syslog (UDP) → Logstash → Elasticsearch → Kibana