Blog
Blog
AWS load balancing deployment with zero downtime

AWS load balancing deployment with zero downtime

Rulin.Tang
posted on 2023-05-17 10:37
Cloud compute
Yanlin

At present, all of our application services are deployed using the Kubernetes platform of AWS EKS for container deployment, and at the same time use aws Application Load Balancer as the public network traffic entrance of the service. At the beginning of each deployment, the service will stop. Stop times range from seconds to minutes.

We did a series of tests and script verification.

Here is one of our deployment scenarios:

  • service jiameng-api-dev running on aws EKS
  • service jiameng-api-dev has two Pods
  • Aws Load Lalancer used as Kubernetes ingress service
  • kubctl apply is used to force the Pod to rebuild the container when the image image changes in aws ECR

To guarantee aws zero deployment downtime, we enabled alb pod_readiness_gate

When alb pod_readiness_gate is not enabled, there may be no healthy targets in the target group. Targets can be in a depleted or initial state, but no health.

When alb pod_readiness_gate is enabled, it is guaranteed to always have at least one healthy target available for the target group. However, this only reduces the chance of 5xx errors, it does not eliminate them 100%.

Let's write a script to test it. See aws_alb_test.sh .It uses describe-target-health to get the target health status before sending a request to the Go application.

We also enabled alb access logs to track which targets were serving our test requests.

Here are the test steps:

  • Run the shell script ./aws_alb_test.sh in a terminal to send requests to the load balancer
  • In k8s_deployment.yaml change date value date: "<DATE>" and other test 
  • changes to force pod rebuild in next kubectl apply command
  • Run kubectl apply -f k8s_deployment.yaml in a separate terminal to rebuild the pod
  • Watch shell script terminal results search for 5xx errors

5xx false result

Ideally we can expect zero downtime with alb pod_readiness_gate enabled. But actually, we can still observe 5xx errors from shell script terminal.

The example above shows two terminating targets and two healthy targets at the same time. It looks like one of the two depletion targets is still receiving traffic, otherwise we should get 200 from the other two health targets.

To reconfirm, we found load balancer access log records for the above requests. Sensitive data has been replaced with xxxxxxxx.

We can see that the load balancer has routed requests to this drain target 10.0.2.61:1325. target_status_code is -, elb_status_code is 504. It means that the connection between the load balancer and this destination 10.0.2.61 is closed, and then the load balancer returns 504 to the client. An explanation of these fields can be found in the Qlog. This connection close makes sense because the pod jiameng-api-dev-7cf849584f-kh7vk for this target 10.0.2.61 is in terminated state and this pod can exit very quickly.

5xx error reason

The load balancer routed traffic to the drained target, but the connection between the load balancer and the drained target was closed because the target's pod terminated.

5xx error solution

To fix this, we want to make the drain target always available. In other words, if a target is in the draining state, its associated pods should not exit. In this way, we want the connection between the load balancer and the draining target to not be closed unless the connection to the target's idle times out.

In the end we found that these parameter values ​​are more useful

In summary, we want a Pod to live longer than its associated target, which is terminationGracePeriodSeconds > preStop > Deregistration delay for these three values.

In the end we achieved the effect of deploying with zero downtime. For more details, see Yanlin's English sharing on github https://github.com/yanlin-group/cicd.