Traps on the way of Blue-Green deployments

Docker (as well as Kubernetes) gives you a way to update your applications with no-downtime through a strategy called Blue-Green deployment. Here we can see a common trap (and solution) when building PHP-based applications and doing blue-green deployments.

Docker (as well as Kubernetes) offers you a way to update your applications with no-downtime through a common strategy called Blue-Green deployment.

Blue-Green deployments work in this way:

  1. Your currently deployed application ("Green") is serving the incoming traffic.
  2. A new version of your application is deployed ("Blue") and tested, but is not yet receiving any traffic.
  3. When "Blue" is ready, we can start sending the incoming traffic to "Blue" too.
  4. At this point we have two copies of our application running in parallel (the "Green" and the "Blue").
  5. Now we have to stop sending incoming traffic to the "Green" application, "Blue" is handling all the incoming traffic.
  6. Since "Green" is not receiving any traffic anymore, it can be safely removed.
  7. The "Blue" will be marked as "Green", allowing in the future a deploy of a newer version using the same strategy.

Docker Swarm blue-green deployment

If you are using Docker Swarm, this can be a stack file that implements the blue-green deployment strategy.

# app.yml

version: '3.4'
services:
  app:
    image: acme/todo-list:${VERSION}
    deploy:
      update_config:
        order: start-first

acme/todo-list is a simple todo-list web application. The update_config.order: start-first instructs docker swarm to use the blue-green deploy strategy.

Do deploy the v1 of out todo-list in a Docker-Swarm cluster we can run:

VERSION=v1 docker stack deploy todolist_app -c app.yml

If we want to update the app to v2 we can run:

VERSION=v2 docker stack deploy todolist_app -c app.yml

The update will follow the blue-green deploy strategy as described before. Docker will keep v1 running and will deploy v2. When v2 is ready, it will redirect all the traffic to v2 and will remove v1. Neat!

But if we look at the logs of our load balancer we can see something as:

1.2.3.4 - - [13/Jun/2019:06:00:10 +0000] "GET /toto/list HTTP/1.1" 502 150 ....

The status code is 502, "Bad Gateway". This HTTP status code that means that the load balancer has received an invalid response from the application server (or no response). Why is that?

To remove v1, after stopping to send incoming traffic, Docker sends the SIG_TERM signal to the app and waits
up to 10 seconds for the app to gracefully terminate itself. If v1 is still running after 10 seconds, Docker brutally kills the v1 app. This will terminate any connection the app had (and pending requests will receive 502 error).

Graceful stop

We can change the amount of seconds Docker will wait before removing the container by tuning the stop_grace_period parameter:

# app.yml

version: '3.4'
services:
  php:
    image: acme/todo-list:${VERSION}
    stop_grace_period: 120s
    deploy:
      update_config:
        order: start-first

With this configuration, after sending the SIG_TERM signal, docker will wait up to two minutes before killing the app. Depending on the specific logic and response times of your application, you can tell to docker how long to wait for your application to terminate before forcing it.

Blue-green deployments and PHP-FPM

If you are using PHP-FPM, the previous configurations might not be enough.

Unfortunately (?) PHP-FPM is configured by default to terminate immediately after receiving the SIG_TERM signal. Even if docker is ready to wait for 10 seconds (or any other value you might have configured with stop_grace_period) PHP will terminate itself (and all the requests being served) without waiting.

This will lead again to 502 errors.

To solve this we have to instruct also PHP to give enough time itself to complete serving the pending requests, tuning process_control_timeout parameter (check here for a full list of PHP-FPM configurations).

By setting process_control_timeout = 5, PHP-FPM will wait up to 5 seconds before exiting and killing all the processes that were serving requests.

We can add this parameter in the Dockerfile when building our PHP image.

# Dockerfile
FROM php:fpm

# ... 

RUN { \
    echo '[global]'; \
    echo 'process_control_timeout = 5'; \
    } | tee /usr/local/etc/php-fpm.conf

# ... 

In the same way how docker was waiting for a container to finish its job, now PHP will do the same and wait up to 5 seconds for its child processes to finish serving the requests.

In this way we configured how long docker should wait before terminating the container, and also how long PHP will wait to complete the requests.

If PHP is able to stop running in less than 5 seconds, it will do it (for instance when all the pending requests are served quickly). The same applies for docker. In this way, these timeouts are applied only for the worse case.

php, docker, devops, swarm

Want more info?