-
Notifications
You must be signed in to change notification settings - Fork 8
Home
ServiceQ is a fault-tolerant HTTP load balancer. All incoming HTTP/HTTPS requests are load balanced across multiple endpoints based on an adaptive algorithm and are buffered in case of total cluster failures (e.g network failures or api errors). The buffering provides assurance to clients that the requests will be executed irrespective of the present state of the system.
All configurations are handled by file sq.properties. The permanent location of this file is /usr/local/serviceq/config/sq.properties. There are 4 mandatory properties - LISTENER_PORT, PROTO, ENDPOINTS, CONCURRENCY_PEAK and rest are optional or to be left default. Default LISTENER_PORT is 5252 and default PROTO is http. Every property is added to a separate line and can be commented by a # prefix.
LISTENER_PORT=5252 PROTO=http (same for http and https)
The group of downstream services, deployed on a set of servers/ports, can be added as a comma-separated list to ENDPOINTS. Make sure scheme is added to every endpoint with an optional port. If port is not provided, ServiceQ will consider http and https endpoints to be running on 80 and 443 respectively. Although not suggestible, but endpoint list can contain a combination of services running on both http and https scheme.
ENDPOINTS=http://my.server1.com:8080,http://my.server2.com:8080,http://my.server3.com:8080
ServiceQ can handle and distribute concurrent requests pretty well, and is only limited by the load handling capabilities of downstream services. That is why it is encouraged to load test services (any of the load testers - ab, wrk, jmeter, httpress would do) and find out maximum load the cluster can take. Determining this at cluster level is important because the bottleneck is usually a central database or message queue or throttled third party service, being accessed from all services. Both the active connections and deferred queue are governed by this limit.
CONCURRENCY_PEAK=2048
If the concurrency peak is achieved, ServiceQ will gladly respond with 429 Too Many Requests. If due to any block, ServiceQ is unable to accept new connections, it will remain queued in the linux receive buffer, due to which clients may experience slowness. So, it is advisable to have client timeouts for such scenarios.
If n out of n nodes are down, the requests are queued up. These are forwarded in FIFO order when any one node is available next. Though the system doesn't place restriction, unless asked to, on the kind of requests that can be queued up and forwarded, it is important to note the implications of the same. ServiceQ responds 503 Service Unavailable to the client if all nodes are down. The deferred request behaviour thus becomes desirable, in cases of requests that contain HTTP methods which change the state of the system and client's workflow is not dependant on the response. So, a fire and forget PUT request, when all services were down, will go and update the destination system, albeit at a later point in time. On the other hand, if the client has exited after firing a GET request, and ServiceQ tries to get response on next availability, the result of GET is lost and is an overhead to the system. This should be avoided. The control is provided to the user on whether to enable queueing and the kind of requests to be considered for queueing (for example we might want to have only POST/PUT/PATCH/DELETE on specific routes to be buffered).
Enable/Disable deferred queue
ENABLE_DEFERRED_Q=true
Format of Request to enable deferred queue for (Suffix ! if disabling for a single method+route combination). First token should always be method followed by optional route and optional !. This option only works if ENABLE_DEFERRED_Q=true. Few examples -
DEFERRED_Q_REQUEST_FORMATS=ALL (buffer all) DEFERRED_Q_REQUEST_FORMATS=POST,PUT,PATCH,DELETE (buffer these methods) DEFERRED_Q_REQUEST_FORMATS=POST /orders !,POST,PUT /orders,PATCH,DELETE (buffer POST except /orders, block PUT except PUT /orders) DEFERRED_Q_REQUEST_FORMATS=POST /orders,PATCH (buffer POST /orders, PATCH)
ServiceQ's approach to routing involves building an error feedback loop on top of randomized+roundrobin algorithm. The routing algorithm takes into account the retry attempt and current/past service state in order to choose the best possible routing. Due to inclusion of error feedback, the probability of choosing an unhealthy node reduces over time. So, if all nodes are up, there is <5% deviation, and it increases if few nodes are deemed unhealthy. If all nodes are down and ServiceQ fails to successfully process the request, the request is queued to be processed later (if eligible).
n-node cluster healthy, all nodes up
Active connections are forwarded to one of the nodes in the cluster. The choice of node is made after consulting with the routing algorithm. The maximum number of active connections are governed by CONCURRENCY_PEAK setting.
n-node cluster unhealthy, [1:n-1] nodes down
Process is same as above, except that the error rate increases, which is stored in a hashset against the service address and logged to the disk with appropriate error code.
n-node cluster unhealthy, n nodes down
Error rate shoots to 100%, and request is bufferred in a FIFO queue. The bufferred request remains in the queue until minimum of one service is re-available. If there are active connections being accepted at this moment, they are forwarded concurrently to bufferred requests. There is no precedence logic here.
It is a good practice to set timeouts to outgoing requests, so time taking requests can be shorted, connections can be freed up and latency is kept in check. To enable this behaviour, ServiceQ adds a timeout to every outgoing request to cluster. The default value is 5s and should be kept low to allow retries to be faster.
#Timeout (s) is added to each outgoing request to endpoints, the existing timeouts are overriden, value of -1 means no timeout OUTGOING_REQUEST_TIMEOUT=5
ServiceQ can add custom headers to the client responses. These are added as a pipe-separated list to CUSTOM_RESPONSE_HEADERS. It is recommended to thoroughly test the headers before adding them as few of them can adversely affect the client.
CUSTOM_RESPONSE_HEADERS=Connection: keep-alive|Server
If downstream and clients are both alive, ServiceQ simply tunnels the response from downstream to client. In case of failures, relevant responses are provided to help the client recover. For example -
Downstream Connected - Tunneled Response Concurrent Conn Limit Exceed - 429 Too Many Requests ({"sq_msg": "Request Discarded"}) All Nodes Are Down - 503 Service Unavailable ({"sq_msg": "Request Buffered"}) Request Timed Out - 504 Gateway Timeout Request Malformed - 400 Bad Request Undeterministic Error - 502 Bad Gateway
ServiceQ detects and logs three types of errors: ServiceQ Flooded (Error Code 601), Service Unavailability (Error Code 701) and HTTP Request (Error Code 702) errors. Error Code 702 includes downstream timeouts, malformed request and unexpected connection loss. Errors are logged to /usr/local/serviceq/logs/serviceq_error.log and follow this format -
ServiceQ: 2018/04/18 14:10:28 Error detected on https://api.server0.org:8001 [Code: 601, SERVICEQ_FLOODED] ServiceQ: 2018/04/18 14:10:28 Error detected on https://api.server0.org:8001 [Code: 701, DOWNSTREAM_DOWN] ServiceQ: 2018/04/18 14:11:12 Error detected on https://api.server1.org:8002 [Code: 702, DOWNSTREAM_TIMED_OUT] ServiceQ: 2018/04/18 14:13:33 Error detected on https://api.server1.org:8002 [Code: 702, DOWNSTREAM_TIMED_OUT]
The data related errors detected by downstream are tunneled directly to client without logging.
ServiceQ comes with HTTP over TLS/SSL support making proxy connections more secure. By default, SSL is disabled. User can enable it by setting SSL_ENABLE key and providing certificate and private key files.
Enable SSL connections
SSL_ENABLE=false
Path to certificate and private key used for TLS handshake
SSL_CERTIFICATE_FILE=/usr/certs/cert.pem SSL_PRIVATE_KEY_FILE=/usr/certs/key.pem
To improve on the TLS performance, it is advisable to add keep-alive header to the CUSTOM_RESPONSE_HEADERS key. When using keep-alive, an optional timeout can be added after which the persistent TCP connection will be dropped and client has to re-establish connection.
KEEP_ALIVE_TIMEOUT=120
Send questions, suggestions, comments at [email protected]