-
Notifications
You must be signed in to change notification settings - Fork 8
Home
ServiceQ is a HTTP load balancer and request queue. All incoming HTTP/HTTPS requests are load balanced across multiple endpoints based on an intuitive algorithm and are buffered in case of network failures or api errors. The buffering provides assurance to clients that the requests will be executed irrespective of the present state of the system.
All configurations are handled by file sq.properties. The permanent location of this file is /opt/serviceq/config/sq.properties. There are 4 mandatory properties - LISTENER_PORT, PROTO, ENDPOINTS, CONCURRENCY_PEAK and rest are optional or to be left default. Default LISTENER_PORT is 5252 and default PROTO is http. Every property is added to a separate line and can be commented by a # prefix.
LISTENER_PORT=5252 PROTO=http (same for http and https)
The group of upstream services (1 or many) can be deployed on a set of servers/ports. They are added as a comma-separated list to ENDPOINTS. Make sure scheme is added to every endpoint with an optional port. If port is not provided, ServiceQ will consider http and https endpoints to be running on 80 and 443 respectively. Although not suggestible, but endpoints list can contain a combination of services running on both http and https scheme.
ENDPOINTS=https://api.server2.com,https://api.server0.com:8000,https://api.server1.com:8001
ServiceQ can handle and distribute concurrent requests pretty well, and is only limited by the load handling capabilities of upstream services. That is why it is encouraged to load test services and find out maximum load the cluster can take. Determining this at cluster level is important because the bottleneck is usually a central database or message queue or throttled third party service, being accessed from all services. Both the active connections and buffered queue are governed by this limit.
CONCURRENCY_PEAK=2048
If the concurrency peak is achieved, ServiceQ will gladly respond with 429 Too Many Requests. If due to any blockage, ServiceQ is unable to accept new connections, it will remain queued in the linux receive buffer, due to which clients may experience slowness. So, it is advisable to have client timeouts for such scenarios.
If n out of n nodes are down, the requests are queued up. These are forwarded in FIFO order when any one node is available next. Though the system doesn't place restriction, unless asked to, on the kind of requests that can be queued up and forwarded, it is important to note the implications of the same. ServiceQ responds 503 Service Unavailable to the client if all nodes are down. The deferred request behaviour thus becomes desirable, in cases of requests that contain HTTP methods which change the state of the system and client's workflow is not dependant on the response. So, a fire and forget PUT request, when all services were down, will go and update the destination system, albeit at a later point in time. On the other hand, if the client has exited after firing a GET request, and ServiceQ tries to get response on next availability, the result of GET is lost and is an overhead to the system. This should be avoided. The control is provided to the user on whether to enable queueing and the kind of requests to be considered for queueing (for example we might want to have only POST/PUT/PATCH/DELETE on specific routes to be buffered).
Enable/Disable deferred queue
ENABLE_DEFERRED_Q=true
Format of Request to enable deferred queue for (Suffix ! if disabling for a single method+route combination). First token should always be method followed by optional route and optional !. This option only works if ENABLE_DEFERRED_Q=true. Few examples -
DEFERRED_Q_REQUEST_FORMATS=ALL (buffer all) DEFERRED_Q_REQUEST_FORMATS=POST,PUT,PATCH,DELETE (buffer these methods) DEFERRED_Q_REQUEST_FORMATS=POST /orders !,POST,PUT /orders,PATCH,DELETE (buffer POST except /orders, block PUT except PUT /orders) DEFERRED_Q_REQUEST_FORMATS=POST /orders,PATCH (buffer POST /orders, PATCH)
ServiceQ is built to use a combination of randomized and round robin approach to routing. The algorithm selects a random node at first and tries to forward the request. If this request fails, then rest of the nodes are selected in a round robin manner for a total of 2*n+1 retries, after which error is raised, and the request is queued, if eligible. This, in my opinion, gives a fair chance to all nodes and distributes load within 10% deviation (if we use 'number of servers' as the only metric).
n-node cluster healthy, all nodes up
Active connections are forwarded to one of the nodes in the cluster. The choice of node is made after consulting with the routing algorithm. The maximum number of active connections are governed by CONCURRENCY_PEAK setting.
n-node cluster unhealthy, [1:n-1] nodes down
Process is same as above, except that the error rate increases, which is stored in a hashset against the service address.
n-node cluster unhealthy, n nodes down
Error rate shoots to 100%, and request is bufferred in a FIFO queue. The bufferred request remain in the queue until any one service is re-available. If there are active connections being accepted at the same time, they are forwarded concurrently to bufferred requests. There is no precedence logic here.
ServiceQ can add custom headers to the client responses. These are added as a pipe-separated list to CUSTOM_RESPONSE_HEADERS. Recommended to thoroughly test the headers before adding them as few of them can adversely affect the client.
CUSTOM_RESPONSE_HEADERS=Strict-Transport-Security:max-age=31536000|Server:sq v1.0
If upstream and clients are both alive, ServiceQ simply tunnels the response from upstream to client. In case of failures, relevant responses are provided to help the client recover. For example -
Upstream Connected - Tunneled Response Concurrent Conn Limit Exceed - 429 Too Many Requests ({"sq_msg": "Request Discarded"}) All Nodes Are Down - 503 Service Unavailable ({"sq_msg": "Request Buffered"}) Request Timed Out - 504 Gateway Timeout Request Malformed - 400 Bad Request Undeterministic Error - 502 Bad Gateway
ServiceQ detects and logs three types of errors: ServiceQ Flooded (Error Code 601), Service Unavailability (Error Code 701) and HTTP Request (Error Code 702) errors. Error Code 702 includes upstream timeouts, malformed request and unexpected connection loss. Errors are logged to /opt/serviceq/logs/serviceq_error.log and follow this format -
ServiceQ: 2018/04/18 14:10:28 Error detected on https://api.server0.org:8001 [Code: 601, SERVICEQ_FLOODED] ServiceQ: 2018/04/18 14:10:28 Error detected on https://api.server0.org:8001 [Code: 701, UPSTREAM_DOWN] ServiceQ: 2018/04/18 14:11:12 Error detected on https://api.server1.org:8002 [Code: 702, UPSTREAM_TIMED_OUT] ServiceQ: 2018/04/18 14:13:33 Error detected on https://api.server1.org:8002 [Code: 702, UPSTREAM_TIMED_OUT]
The data errors detected by upstream are tunneled directly to client without logging.
Send questions, suggestions, comments at [email protected]