-
Notifications
You must be signed in to change notification settings - Fork 8
Home
ServiceQ is a fault-tolerant HTTP load balancer. All incoming HTTP(S) requests are routed via a probabilistic algorithm to the cluster of upstream nodes and are buffered in case of total cluster shutdown (e.g network failures or api errors). The buffering provides assurance to clients that the requests will be executed irrespective of the present state of the system.
Below graph shows the routing probability (P) on a down node (D) in a 8-node cluster with respect to number of requests (r). Notice how quickly the probability reduces as and when the incoming requests on D starts to fail. Depending on the rate of request, it will only take a few seconds (sometimes even milliseconds) to move all requests away from D, thus ensuring more requests are routed to healthier nodes. Note that, even when requests keep failing on D (however less), ServiceQ retries them on other nodes until they succeed.
If the request fails on all the nodes, ServiceQ buffers the request. It then periodically retries them on the cluster, until they succeed. If the buffer is full, all incoming requests get rejected. Below is the state transition diagram of the system with respect to an incoming request.
R: (Incoming Request)
G: (ServiceQ Router), Q: (ServiceQ Deferred Queue)
PCS: (Partial Cluster Shutdown), TCS: (Total Cluster Shutdown)
All configurations are handled by file sq.properties. The permanent location of this file is /usr/local/serviceq/config/sq.properties. There are 4 mandatory properties - LISTENER_PORT, PROTO, ENDPOINTS, CONCURRENCY_PEAK and rest are optional or to be left default. Default LISTENER_PORT is 5252 and default PROTO is http. Every property is added to a separate line and can be commented by a # prefix.
LISTENER_PORT=5252 PROTO=http (same for http and https)
The group of upstream services, deployed on a set of servers/ports, can be added as a comma-separated list to ENDPOINTS. Make sure scheme is added to every endpoint with an optional port. If port is not provided, ServiceQ will consider http and https endpoints to be running on 80 and 443 respectively. Although not suggestible, but endpoint list can contain a combination of services running on both http and https scheme.
ENDPOINTS=http://my.server1.com:8080,http://my.server2.com:8080,http://my.server3.com:8080
ServiceQ can handle and distribute concurrent requests pretty well, and is only limited by the load handling capabilities of upstream services. That is why it is encouraged to load test services (any of the load testers - ab, wrk, jmeter, httpress would do) and find out maximum load the cluster can take. Determining this at cluster level is important because the bottleneck is usually a central database or message queue or throttled third party service, being accessed from all services. Both the active connections and deferred queue are governed by this limit.
CONCURRENCY_PEAK=2048
If the concurrency peak is achieved, ServiceQ will gladly respond with 429 Too Many Requests. If due to any block, ServiceQ is unable to accept new connections, it will remain queued in the linux receive buffer, due to which clients may experience slowness. So, it is advisable to have client timeouts for such scenarios.
If n out of n nodes are down, the requests are queued up. These are forwarded in FIFO order when any one node is available next. Though the system doesn't place restriction, unless asked to, on the kind of requests that can be queued up and forwarded, it is important to note the implications of the same. ServiceQ responds 503 Service Unavailable to the client if all nodes are down. The deferred request behaviour thus becomes desirable, in cases of requests that contain HTTP methods which change the state of the system and client's workflow is not dependant on the response. So, a fire and forget PUT request, when all services were down, will go and update the destination system, albeit at a later point in time. On the other hand, if the client has exited after firing a GET request, and ServiceQ tries to get response on next availability, the result of GET is lost and is an overhead to the system. This should be avoided. The control is provided to the user on whether to enable queueing and the kind of requests to be considered for queueing (for example we might want to have only POST/PUT/PATCH/DELETE on specific routes to be buffered).
Enable/Disable deferred queue
ENABLE_DEFERRED_Q=true
Format of Request to enable deferred queue for (Suffix ! if disabling for a single method+route combination). First token should always be method followed by optional route and optional !. This option only works if ENABLE_DEFERRED_Q=true. Few examples -
DEFERRED_Q_REQUEST_FORMATS=ALL (buffer all) DEFERRED_Q_REQUEST_FORMATS=POST,PUT,PATCH,DELETE (buffer these methods) DEFERRED_Q_REQUEST_FORMATS=POST /orders !,POST,PUT /orders,PATCH,DELETE (buffer POST except /orders, block PUT except PUT /orders) DEFERRED_Q_REQUEST_FORMATS=POST /orders,PATCH (buffer POST /orders, PATCH)
ServiceQ's approach to routing involves building an error feedback loop on top of randomized+roundrobin algorithm. The routing algorithm takes into account the retry attempt and current/past service state in order to choose the best possible routing. Due to inclusion of error feedback, the probability of choosing an unhealthy node reduces over time. So, if all nodes are up, there is <5% deviation, and it increases if few nodes are deemed unhealthy. If all nodes are down and ServiceQ fails to successfully process the request, the request is queued to be processed later (if eligible).
n-node cluster healthy, all nodes up
Active connections are forwarded to one of the nodes in the cluster. The choice of node is made after consulting with the routing algorithm. The maximum number of active connections are governed by CONCURRENCY_PEAK setting.
n-node cluster unhealthy, [1:n-1] nodes down
Process is same as above, except that the error rate increases, which is stored in a hashset against the service address and logged to the disk with appropriate error code.
n-node cluster unhealthy, n nodes down
Error rate shoots to 100%, and request is bufferred in a FIFO queue. The bufferred request remains in the queue until minimum of one service is re-available. If there are active connections being accepted at this moment, they are forwarded concurrently to bufferred requests. There is no precedence logic here.
It is a good practice to set timeouts to outgoing requests, so time taking requests can be shorted, connections can be freed up and latency is kept in check. To enable this behaviour, ServiceQ adds a timeout to every outgoing request to cluster. The default value is 5s and should be kept low to allow retries to be faster.
# Timeout (s) is added to each outgoing request to endpoints, the existing timeouts are overriden, value of -1 means no timeout OUTGOING_REQUEST_TIMEOUT=5
ServiceQ can add custom headers to the client responses. These are added as a pipe-separated list to CUSTOM_RESPONSE_HEADERS. It is recommended to thoroughly test the headers before adding them as few of them can adversely affect the client.
CUSTOM_RESPONSE_HEADERS=Connection: keep-alive|Server
If upstream and clients are both alive, ServiceQ simply tunnels the response from upstream to client. In case of failures, relevant responses are provided to help the client recover. For example -
Upstream Connected - Tunneled Response Concurrent Conn Limit Exceed - 429 Too Many Requests ({"sq_msg": "Request Discarded"}) All Nodes Are Down - 503 Service Unavailable ({"sq_msg": "Request Buffered"}) Request Timed Out - 504 Gateway Timeout Request Malformed - 400 Bad Request Undeterministic Error - 502 Bad Gateway
ServiceQ detects and logs three types of errors: ServiceQ Flooded (Error Code 601), Service Unavailability (Error Code 701) and HTTP Request (Error Code 702) errors. Error Code 702 includes upstream timeouts, malformed request and unexpected connection loss. Errors are logged to /usr/local/serviceq/logs/serviceq_error.log and follow this format -
ServiceQ: 2020/06/18 14:10:28 Error detected on https://api.server0.org:8001 [Code: 601, SERVICEQ_FLOODED] ServiceQ: 2020/06/18 14:10:28 Error detected on https://api.server0.org:8001 [Code: 701, UPSTREAM_DOWN] ServiceQ: 2020/06/18 14:11:12 Error detected on https://api.server1.org:8002 [Code: 702, UPSTREAM_TIMED_OUT] ServiceQ: 2020/06/18 14:13:33 Error detected on https://api.server1.org:8002 [Code: 702, UPSTREAM_TIMED_OUT]
The data related errors detected from upstream are tunneled directly to client without logging.
ServiceQ comes with complete TLS/SSL support making proxy connections more secure. By default, SSL is disabled. It can be enabled by setting SSL_ENABLE to true.
SSL_ENABLE=true
SSL handshake requires a SSL certificate and a private key. There are two ways to add them in ServiceQ -
Automatic
ServiceQ can automatically issue and store SSL certificate and private key from designated CA (Let's Encrypt). It can be enabled by setting SSL_AUTO_ENABLE to true.
SSL_AUTO_ENABLE=true
In order for issuance process to succeed, user also needs to configure below information -
# Any path with appropriate read/write permissions will work SSL_AUTO_CERTIFICATE_DIR=/etc/ssl/certs # Any email [email protected] # Domain pointing to whichever IP/port serviceq is running SSL_AUTO_DOMAIN_NAMES=myservice.com # Renew before x days of expiration SSL_AUTO_RENEW_BEFORE=30
Note that SSL_AUTO_ENABLE=true is only considered if SSL_ENABLE=true.
Manual
Self obtain SSL certificate and private key files from a CA and configure as below -
# Any path with appropriate read permissions will work SSL_CERTIFICATE_FILE=/usr/certs/cert.pem SSL_PRIVATE_KEY_FILE=/usr/certs/key.pem
Note that manual mode is only considered if SSL_ENABLE=true and SSL_AUTO_ENABLE=false.
To improve on SSL performance, it is advisable to add keep-alive header to the CUSTOM_RESPONSE_HEADERS key. When using keep-alive, an optional timeout can be added after which the persistent TCP connection will be dropped and client has to re-establish connection.
KEEP_ALIVE_TIMEOUT=120
Send questions, suggestions, comments at [email protected]