In my day job I work with a lot of services that sit behind a load balancer providing high availability across multiple backend hosts.
Now, inevitably there are times when these services or their hosting servers require maintenance. And sometimes we want to do some investigation and troubleshooting against a running service, but without it taking any live traffic; it simply isn’t practical to try and involve the network team’s assistance with disabling interfaces gracefully. The usual approach is to consider using a server-side firewall to block inbound port access but this can be clumsy and can actually impact live traffic, albeit briefly.
One solution I like is to use an intermediate monitor (or watchdog) service that provides a healthcheck URL for the load balancer, say, http://192.168.1.100/my-service/healthcheck, where the returned status is derived from the application ports being monitored.
Now, we want this to be as lightweight as possible, so we can choose something like Python’s Flask and uwsgi (or Ruby Sinatra) to provide a simple service listener like,
#!/usr/bin/env python # from flask import Flask, abort, request, Response, redirect import os import requests app = Flask(__name__) @app.route('/my-service/healthcheck', methods=["GET"]) def heartbeat(): resp = Response(response = "OK", status = 200, content_type = "text/plain") # Now check for the node statuses nodes = ( "inbound", "outbound", "stats" ) for node in nodes: req = requests.get("http://localhost:7070/" + node + "/isalive") if(req.status_code != 200): resp.status = "FAILED" resp.status_code = req.status_code return(resp) if __name__ == "__main__": app.run()
And obviously I have skipped the setup with pip, virtualenv and the like, but that’s routine enough.
The beauty with this kind of approach is that with a few extra lines before the service port polling we can spoof an outage and allow the load balancer to complete any existing client connections (that the firewall approach will prevent) while marking the node out of action,
# Check for the maintenance file and signal graceful failure if(os.path.isfile("maintenance")): resp.status = "Under maintenance. Remove maintenance file when complete" resp.status_code = 503
Now, by simple touching a file called maintenance in the directory where the application is run from, the next poll from the load balancer will register the failure, and we can test this with cURL,
$ curl -v http://localhost:7070/my-service/healthcheck * Trying 127.0.0.1... * TCP_NODELAY set * Connected to localhost (127.0.0.1) port 7070 (#0) > GET /my-service/healthcheck HTTP/1.1 > Host: localhost:7070 > User-Agent: curl/7.52.1 > Accept: */* > < HTTP/1.1 503 SERVICE UNAVAILABLE < Content-Type: text/plain < Content-Length: 2
Remove the file and the traffic will flow again. Remote control of the load balancer without stopping any services, reboot persistent and allowing us time and space to investigate as we please.