How to Save More Than 2/3 of Engineers’ Kubernetes Cost

Using the cloud is not cheap. If you are using a public cloud, you will get a monthly bill for computing resources, traffic, and for any additional services. But also for a private cloud, the cost is substantial, taking into account the hardware, datacenter, and maintenance cost. (For simplicity, I will assume the public cloud case in the following, but I added a paragraph about private clouds at the end.)
The widespread adoption of Kubernetes has not changed this situation fundamentally. Some say it has only increased cost because the Kubernetes clusters themselves consume resources (e.g. Master node, LoadBalancer, etcd). Others are convinced that Kubernetes could at least reduce the cost because it makes the use of the cloud resources more efficient. Overall, it probably depends on your setting and perspective, but the cost of the cloud remains an issue either way.
For this, reducing Kubernetes cloud computing cost is always attractive, especially with more and more companies adopting Kubernetes throughout the organization and introducing cloud-native development, which includes giving engineers access to Kubernetes for development, testing, experimentation, or CI/CD (level 2 or level 3 on the cloud-native journey according to this article).
In these cases, companies are sometimes hesitant to spend a lot of money on the cloud as this, contrary to spending money on production workloads, is not directly connected to revenue and customer benefit. Still, giving developers direct Kubernetes access can make a lot of sense and can pay off in terms of higher stability and more efficient workflows, too.
Making cloud-native development cheaper and easier can thus be an important driver for Kubernetes adoption, especially in small and medium-sized organizations. I will therefore describe 2 “obvious” inefficiencies of the engineers’ use of the cloud and easy ways of eliminating them to significantly reduce the cost of using Kubernetes. Since developers use the cloud differently than production systems, these cost-saving opportunities are most relevant for non-production environments.
An easy option to give every engineer access to a cloud environment is to give everyone an own cluster. However, this is pretty inefficient for several reasons.
The solution to the first inefficiency of many non-production clusters is to share clusters among engineers. That means that only one cluster is used and developers are working with their own namespaces that limit them and isolate them from each other. I wrote a separate article about a comparison of individual clusters and shared clusters, but regarding the described cost efficiencies, sharing a cluster has the following advantages:
The Kubernetes cost savings from reducing the number of clusters by sharing a cluster are highest if there are many clusters used before, which is typical for larger teams or if clusters are used for many purposes such as in-cluster development, testing, and staging. The saving effect relative to the total cost is also particularly high if the clusters are rather small and thus cheap. In these cases, a cluster management fee and the redundancy is a comparably expensive cost driver.
Let’s take a look at a simple example to demonstrate the cost-saving opportunities of reducing the number of clusters by Kubernetes cluster sharing:
If in a small team of 5 developers, everyone gets an own Kubernetes cluster and the clusters are always running, you have to pay $360 only for the cluster management fee. If they share one cluster, you only have to pay the cluster management fee for one cluster, which is $72 per month. In such a small team, you would thus already save $288 just in cluster management fees. (For GKE, one “zonal cluster” is free, so you would pay $288 in cluster management fees for 5 clusters and $0 for one cluster. Overall, the savings remain unchanged.)
On top of that, you could save additional money by sharing the basic Kubernetes features and by a better control which prevents unused clusters. However, since these factors are hard to estimate, I do not add them to the calculation.
Of course, if you use commercial tools such as Loft, you need to pay for these tools, too. In the example of 5 developers using Loft, the cost would be $75.
Your total savings from sharing a cluster would so be at least $213 (60%) for only a small team of 5.
In non-production settings, the Kubernetes clusters are actually unused most of the time, while they cost you money all the time. One option to prevent this is to tell your engineers to reliably delete their stuff or shut down their spaces when they do not use them anymore. This is, however, not so easy to implement in reality. Sometimes, engineers may just forget about it as distractions come in or they are not at work when they should shut down their spaces. A few examples, when computing resources are typically unused:
The solution to this problem is to activate a so-called sleep mode of unused spaces in the cloud. This means that the containers will be terminated while all persistent data will be preserved. The computing resources can thus be scaled down and you do not have to pay for them anymore. When they are needed again, they can be scaled up and the containers will be restarted automatically again with all the persistent data immediately available. In tools such as Loft, a sleep mode can either be started manually or automatically after a pre-defined period of time.
Overall, this means that the cloud resources in Kubernetes that the engineers use are flexibly available, and you only pay during the times they are really needed. The advantage of an automatic sleep mode is that it does not affect the engineers at all, they do neither have to start nor to end it, so their Kubernetes workflow remains as efficient as possible.
The cost savings of a sleep mode are of course highest when there are a lot of computing resources generally used, i.e. in larger teams or with complex applications and experiments with high resource requirements. Another factor for the potential of the sleep mode is the length of the idle times.
For a typical engineer with a 40 hour work week, the savings can be about 75% (128 of 168 hours per week are non-working hours, not accounting for holidays, sick leave, or other tasks of the engineer that do not require computing resources such as meetings). If the Kubernetes environment is used for some other tasks such as testing or experimentation, the savings can be even higher (potentially, there are days during which 100% could be saved when no experiments are run during that day).
Now, we can look at a similar example as before with a team of 5 engineers. Since we are now looking at computing costs only, we will ignore cluster management fees. We also assume that the average engineer works 8 hours a day on 5 days per week. For simplicity, we further ignore additional factors such as holidays, sick leave, and other tasks that do not require a Kubernetes access, e.g. meetings. In sum, these factors can of course be substantial but would only support the result of the example in terms of further saving opportunities.
Now, let’s say that the engineers need 4vCPUs and 15GB RAM each to run the application they are working on. Depending on the software, this could, of course, be quite a lot (e.g. for a simple website) or not nearly enough (e.g. for complex ML applications), so there is some variance here. In GKE, this is equal to an “n1-standard-4” machine that costs about $97 per month.
For the team of 5 engineers, you would so face computing resource cost of $485 per month. Since the engineers are only working 40 hours per week, the computing resources are not used 128 hours per week, which is more than 75% of the time.
With a sleep mode configured in a way that sends unused spaces to sleep after 1 hour of inactivity, the computing resources are only active for 45 hours per week and the unused time is reduced to 5 hours (1 hour per day). For this, you only need to pay about 27% of the computing resource cost and you can reduce the time during which the resources are unused from 75% to 12.5%.
In absolute terms, this means that you only have to pay about $130 for all 5 engineers per month saving about $355. Again, even after deducting $75 for tools to enable a sleep mode, such as Loft, you can get net savings of $280 every month.
With solutions such as Loft, that is based on kiosk, it is possible to eliminate both of the described inefficiencies. Loft is a multi-tenancy manager for Kubernetes that allows engineers to create namespaces in a shared cluster on-demand. It also comes with an in-built sleep mode feature to save costs for unused computing resources.
To use Loft, you install it in a Kubernetes cluster that then becomes your management cluster for Loft. Afterward, you can connect other clusters (or the management cluster itself) to Loft and in these connected clusters, the engineers can start their isolated namespaces. Overall, Loft achieves the solutions I described before out-of-the-box:
Combining the two examples from above, one can calculate the total cost savings of introducing shared clusters and a sleep mode for the sample engineering team of 5. For this, one only has to sum up the cost savings from both examples. The only thing that needs to be adjusted is the cost to get cluster-sharing and the sleep mode. Since both features are included in Loft, the usage fee for it only has to be accounted for once.
Overall, it is so possible to save more than two-thirds of the cost, which is $568 for the sample engineering team. Reducing the cost per engineer from $169 to $55.4 makes providing engineers with direct access to Kubernetes in the cloud also feasible for smaller organizations.
The described problems and solutions are mostly applicable to private clouds, too. Here, sometimes the usage-cost relationship is not so clear, especially if you own the hardware yourself. In these cases, the computing cost for the cloud is rather a fixed cost than a variable cost, except for power and network traffic cost. Some other additional factors such as cluster management fees are not an issue in private clouds, but the maintenance effort is usually a bigger challenge.
If you apply the described solutions of cluster sharing and a sleep mode in a private cloud, it can still save you cost, e.g. because central Kubernetes features are not running redundantly. Overall, the direct cost effect is potentially smaller, but there is an additional indirect effect: If you are using sleep mode and can configure user limits centrally, it is easier to “oversubscribe” the available computing resources as you already know that not all of them will be used at the same time.
For example, you could activate the sleep mode so that all spaces for in-cluster development for developers are sleeping after work and then schedule complex processes such as ML experiments to run at night in spaces that will sleep again in the morning when the engineers’ workday starts. Another example would be remote teams in different time zones that can alternately use the same computing resources.
Therefore, there is also a savings impact in private clouds that might be less immediate but if you are able to use the available computing resources more efficiently, this is generally beneficial. This can eventually lead to fewer hardware needs, which will also save you a lot of cost.
With engineers’ access to cloud environments, the cost of the cloud becomes an even more important issue. While simple solutions with many clusters and no central control might be a first start, they are definitely not efficient in terms of cost. Over time, more companies should thus move to more advanced solutions to limit the number of clusters and to reduce the time computing resources cost money while they are not used.
With cluster sharing among engineers and an individually configured sleep mode, a lot of computing resource costs can be saved. Fortunately, the implementation of these solutions with off-the-shelf solutions such as Loft is now easier than ever and available even to small and medium-sized companies and not only to big players that might build such solutions themselves.
In the provided example, the cost can so be reduced by more than two-thirds leading to savings of about $570 per month in a 5-person team. Maybe even more importantly, the cost per engineer is reduced from deterrent $169 to moderate $55.4 per month. This makes the adoption of Kubernetes throughout whole organizations not only cheaper and more attractive but also enables even smaller companies to start their cloud-native transformation.
Photo by Micheile Henderson on Unsplash