Lots of things happened on the technical side recently. Here is a list of new features and improvements we worked on the last couple of months !
automated creation of stack documentation inspired by helm-docs :
stack templates can now define hooks for
[ pre_create, pre_update, post_create, post_update].
These hooks are executed accordingly before or after stack operations.
Current implemented hooks:
cmd: Allows arbitrary commands via sub-process
export_outputs_kubezero: writes the outputs of Kubernetes stacks into a format to be included by KubeZero
Stack outputs are now written into a yaml file under outputs if enabled. Easy lookup / tracking without having to check the AWS Console anymore. Enabled via
New Kubernetes clusters based on the latest CloudBender templates will perform a fully automated bootstrap. This includes control plane as well as any worker nodes.
Existing clusters can be migrated to the new ArgoCD based KubeZero distribution without re-installation!
For new clusters, apart from the core Kubernetes services like API, Scheduler etc. only Calico and CoreDNS are deployed automatically during bootstrap. Calico is installed via the KubeZero Helm chart kubezero-calico .
The first thing after that is deploying the kubezero-argo-cd
chart, which will install ArgoCD itself as well as the KubeZero App-of-apps to install the various other components like
All cluster specific configurations must be provided at this step via the value.yaml file. AWS specific values can easily be injected via the new CloudBender feature creating a cluster specific yaml file.
We are still working on fully automating these steps, but are not 100% there yet.
Future updates and maintenance
The main reason to introduce ArgoCD has been to reduce the amount of time required to stay on top of changes and updates to be deployed.
Once more than one cluster need to be maintained, most likely by more than one person, it becomes a increasing challenge to ensure config changes and updates are applied in a reproducible fashion across all clusters in a timely manner.
Another nice benefit is getting a nice UI within each Kubernetes cluster allowing quick status checks as well as initial debugging etc.
Thanks to ArgoCD changes and upgrades can be automatically installed or rolled back via git commits.
Zero Down Time maintains two branches of KubeZero.
- master: Latest features suited for dev and test clusters
- stable: Only tested features & changes suited for production clusters
Existing clusters should upgrade from
The upgrade process is fully automated and triggered by replacing the first controller node and all other nodes after that.
No down time nor service interruption, except for single controller node clusters during replacement of that node.
- the network backend changes from Flannel to calico VxLAN.
- the MTU increases from 1450 to 8941 to improve the network performance on AWS clusters
This change requires a cluster wide migration, which must be scheduled for each cluster
None of the updates below require any specific changes nor down time:
- cert-mananger to 0.15
- kiam upgraded to 3.6-rc1
- Istio to 1.4.10
- minor tweaks and changes for Kube-Prometheus-Grafana and ElasticSearch/Fluent components
CloudFormation template library
Automated Version Upgrades
The Kubernetes version is controlled via the CloudFormation parameter
KubernetesVersion allowing to control exact version or e.g. latest 1.16.X
Version upgrades only happen if controller node
00 boots up.
Any other controller node as well as all worker nodes will always install the exact same version the cluster is currently running.
Instance shutdown hooks
worker nodes now try to drain and even delete themselves if they get terminated via eg. the AWS Console
single controller nodes now take an emergency backup if terminated
Improved bootstrap / join flow
whole clusters can now be launched at the same time
Controller and Worker CloudFormation stacks can be launched in one go
worker nodes will wait until the control plane becomes available even during cluster bootstrap
Core services updated to Ubuntu 20.04 LTS
Ability to execute scripts at terminate / reboot via custom init service, eg. messages to Slack for instance events like terminate
With the optional
AutoRollingUpdatefeature, instances for services like bastion, nat or vpngw update/replace themselves automatically during stack updates without interruption.
Instances with the latest configuration are launched first, taking over the functionality before any previous instances are terminated.
Improved operational visibility via optional SNS to Slack messaging for EC2 instance events at boot, terminate etc.
This includes custom info depending on the instance as well as optional debugging information in case of errors during boot.
Slack messages now leverage their new APIs: improved layout, additional information and deep links into the AWS Console for various events
Bastion hosts now support multiple keys for SSH access. The look up is performed at login time allowing instant access control across all AWS infrastructure. Allowed users must be member of a certain IAM Group as well have uploaded their SSH keys. This also works across AWS accounts to support central user management.
new template to deploy CloudFront, an S3 Bucket, IAM content user and Lambda Edge functions to host static websites, e.g. generated by Hugo
https://zero-downtime.net, this very site is deployed with it 😄
Further improvements to reduce the size of user-data to allow more complex boot logic while staying under the AWS imposed size limit