Tech Updates for Q2/2020


Lots of things happened on the technical side recently. Here is a list of new features and improvements we worked on the last couple of months !


  • automated creation of stack documentation inspired by helm-docs : cloudbender create-docs

  • stack templates can now define hooks for [ pre_create, pre_update, post_create, post_update].
    These hooks are executed accordingly before or after stack operations.
    Current implemented hooks:

    • cmd: Allows arbitrary commands via sub-process
    • export_outputs_kubezero: writes the outputs of Kubernetes stacks into a format to be included by KubeZero
  • Stack outputs are now written into a yaml file under outputs if enabled. Easy lookup / tracking without having to check the AWS Console anymore. Enabled via options.StoreOutputs


Most components are now installed and automatically maintained by ArgoCD !
With this we implement a GitOps workflow right from the start.

Installation changes

New Kubernetes clusters based on the latest CloudBender templates will perform a fully automated bootstrap. This includes control plane as well as any worker nodes.
Existing clusters can be migrated to the new ArgoCD based KubeZero distribution without re-installation!

For new clusters, apart from the core Kubernetes services like API, Scheduler etc. only Calico and CoreDNS are deployed automatically during bootstrap. Calico is installed via the KubeZero Helm chart kubezero-calico .

The first thing after that is deploying the kubezero-argo-cd chart, which will install ArgoCD itself as well as the KubeZero App-of-apps to install the various other components like cert-manager, kiam, etc.
All cluster specific configurations must be provided at this step via the value.yaml file. AWS specific values can easily be injected via the new CloudBender feature creating a cluster specific yaml file.

We are still working on fully automating these steps, but are not 100% there yet.

Future updates and maintenance

The main reason to introduce ArgoCD has been to reduce the amount of time required to stay on top of changes and updates to be deployed.
Once more than one cluster need to be maintained, most likely by more than one person, it becomes a increasing challenge to ensure config changes and updates are applied in a reproducible fashion across all clusters in a timely manner.

Another nice benefit is getting a nice UI within each Kubernetes cluster allowing quick status checks as well as initial debugging etc.

Thanks to ArgoCD changes and upgrades can be automatically installed or rolled back via git commits.

Zero Down Time maintains two branches of KubeZero.

  • master: Latest features suited for dev and test clusters
  • stable: Only tested features & changes suited for production clusters

Component changes


Existing clusters should upgrade from v1.16.X to v1.16.12
The upgrade process is fully automated and triggered by replacing the first controller node and all other nodes after that.
No down time nor service interruption, except for single controller node clusters during replacement of that node.


  • the network backend changes from Flannel to calico VxLAN.
  • the MTU increases from 1450 to 8941 to improve the network performance on AWS clusters

This change requires a cluster wide migration, which must be scheduled for each cluster

Remaining components

None of the updates below require any specific changes nor down time:

  • cert-mananger to 0.15
  • kiam upgraded to 3.6-rc1
  • Istio to 1.4.10
  • minor tweaks and changes for Kube-Prometheus-Grafana and ElasticSearch/Fluent components

CloudFormation template library

(customers only)


Automated Version Upgrades

The Kubernetes version is controlled via the CloudFormation parameter KubernetesVersion allowing to control exact version or e.g. latest 1.16.X
Version upgrades only happen if controller node 00 boots up.

Any other controller node as well as all worker nodes will always install the exact same version the cluster is currently running.

Instance shutdown hooks

  • worker nodes now try to drain and even delete themselves if they get terminated via eg. the AWS Console

  • single controller nodes now take an emergency backup if terminated

Improved bootstrap / join flow

  • whole clusters can now be launched at the same time
    Controller and Worker CloudFormation stacks can be launched in one go

  • worker nodes will wait until the control plane becomes available even during cluster bootstrap


  • Core services updated to Ubuntu 20.04 LTS

  • Ability to execute scripts at terminate / reboot via custom init service, eg. messages to Slack for instance events like terminate

  • With the optional AutoRollingUpdate feature, instances for services like bastion, nat or vpngw update/replace themselves automatically during stack updates without interruption.
    Instances with the latest configuration are launched first, taking over the functionality before any previous instances are terminated.

  • Improved operational visibility via optional SNS to Slack messaging for EC2 instance events at boot, terminate etc.
    This includes custom info depending on the instance as well as optional debugging information in case of errors during boot.
    Slack messages now leverage their new APIs: improved layout, additional information and deep links into the AWS Console for various events

  • Bastion hosts now support multiple keys for SSH access. The look up is performed at login time allowing instant access control across all AWS infrastructure. Allowed users must be member of a certain IAM Group as well have uploaded their SSH keys. This also works across AWS accounts to support central user management.

  • new template to deploy CloudFront, an S3 Bucket, IAM content user and Lambda Edge functions to host static websites, e.g. generated by Hugo, this very site is deployed with it 😄

  • Further improvements to reduce the size of user-data to allow more complex boot logic while staying under the AWS imposed size limit