Why Everyone Builds Internal Kubernetes Platforms
Recently, you might have heard about “internal Kubernetes platforms” from many different sources: KubeCon talks, blog articles, or just colleagues and friends. Even if such platforms are not always called internal Kubernetes platforms, solutions that allow engineers to get a standardized and easy Kubernetes access in a cloud environment seem to become more common now.
In this article, I will describe what drives companies to build and adopt such platforms. For this, I will rely on three publicly described examples of popular tech companies, Eventbrite, Spotify, and Datadog as well as share some of my own experience from working on an out-of-the-box internal Kubernetes platform solution called Loft
#The Internal Kubernetes Platform
I already defined what an internal Kubernetes platform is in another article.
In short, an internal Kubernetes platform provides engineers with a direct access to a Kubernetes dev environment that they can work with during the pre-production stage. Such a platform usually allows the engineers to create self-service namespaces, but solutions with whole clusters or virtual clusters (vClusters) are also possible.
Although the term “platform” may suggest that it is just an infrastructure component, internal Kubernetes platforms often also comprise tools that make the use and management of the platform and the general Kubernetes development workflows, especially deployment workflows, easier for developers, even if they have no previous Kubernetes experience.
Therefore, if you hear about internal Kubernetes platforms, you will also often hear about developer experience and associated tools and not only about technical infrastructure decisions.
#What drives companies to build such platforms
Since it covers so many aspects, building an internal Kubernetes platform for engineers can be quite challenging. For example, it took Spotify 2 years to make its internal platform generally available. Given this extensive engineering effort and the considerable cloud computing cost to run the platform for large teams, the question comes up why companies undertake such an investment. There clearly must be a lot of value or even pressure to do this:
#1. Applications become too large for local environments
So, let’s start exploring potential reasons for internal Kubernetes platforms by looking at the case of Eventbrite: In an interview, Remy DeWolf, principal engineer on the DevTools team at Eventbrite, explained the reasoning why they moved away from local development to cloud-based development with Kubernetes. As many companies do, Eventbrite also started out by building a monolith. This monolith became larger and larger over time, which is why it was partly transformed into microservices and new services were also added as microservices.
Since all of these are running in containers (at least for development), this is not a problem initially. However, as the software grows further, you will eventually reach a point when it is too big to run on local computers with local Kubernetes solutions such as Minikube or with Docker Desktop.
At this point, you can start to develop workarounds, e.g. just running some necessary parts of an application or running some parts locally and others remotely, to keep your existing local development workflows. This may be a suitable approach but because most companies and systems keep growing, it becomes harder over time to continue with local development. This will lead to reduced developer productivity as the developers are more and more struggling with their local setup and slowed-down laptops.
For this, you face a trade-off between investing in an internal platform (and the continuous cost for cloud resources) and wasted engineering time. Here, it becomes clear why many companies start with local Kubernetes solutions, but an increasing number now reaches the “tipping point” when it becomes worthwhile to move to a cloud-based Kubernetes environment.
At Eventbrite, they therefore also made the decision to adopt a cloud environment and built an internal tool called “yak”, which allows engineers to deploy and manage remote containers. So, every engineer now has an own namespace in a shared Kubernetes cluster and can easily run about 50 containers in it.
Such an evolution did not only happen at Eventbrite, but Datadog with about 800 developers seems to have made a similar decision. I can also support this with the experience from our Loft customers: At some point, their applications just become too complex to run locally, so the move to a cloud-based Kubernetes environment is a necessary step. Interestingly, this is not only true for large organizations but can also happen with fewer engineers if they develop generally resource-intense or complex applications, such as AI/ML software.
In many cases, applications just become too large to run locally at some point. Then, it often makes sense to move from a local development environment to a Kubernetes dev environment in the cloud.
#2. Combining autonomy with focus
In his KubeCon talk, James Wen, Senior SRE at Spotify, did not explicitly share the reasons why they build an internal Kubernetes platform. However, one can still derive valuable insights that probably were drivers for this decision:
While he did no mention that local development was not possible for the engineers at Spotify, one can assume this given that they have a very large number of teams (280+), engineers (1,500+), and microservices (1,200+) and that they already had a non-Kubernetes cloud solution in place before.
Still, I think there are some even more interesting aspects that James Wen mentioned: Spotify uses an “Ops in Teams” model, which is an attractive mix of complete autonomy of the teams and completely centralized operations. The teams build and operate their own services independently, but they can rely on tooling and platforms provided by centralized ops teams.
I think this is a smart solution as it gives the teams the option to work autonomously and configure Kubernetes as they need it (they can even adopt it at their desired pace), but they do not have to care for and manage everything themselves and instead use a common platform that resolves all “standard” issues that otherwise had to be solved redundantly.
From an admin (DevTool team) perspective, such a solution is also attractive as the admins do not have to interact with all dev environments individually, which could be a huge problem, e.g. if they had to provision all environments manually. Instead, the admins can focus on providing the cluster and a great platform experience and on supporting the developers using it, e.g. by solving issues, which is also easier as it is possible to share a dev environment in the cloud.
With an internal Kubernetes platform, developers can work freely and autonomously but at the same time do not have to care about operations tasks that can be better solved centrally. The admin teams benefit as they can focus on the platform and support the other engineers, but do not have to care about every individual environment.
#3. Streamlining Kubernetes workflows
When developers start to work with Kubernetes, it can be quite challenging. Using an internal platform can make the transition to Kubernetes easier as some tasks can be streamlined, and the developers do not have to set up Kubernetes themselves.
Spotify, for example, developed a comprehensive tutorial and documentation that enabled new developers to get started with the internal platform and thus Kubernetes easily and from Day 1. At Loft, we are providing the developers (users of our customers’ internal Kubernetes platform) with an onboarding guide that explains the basic concepts and all standard tasks to the developers.
This is possible because the work environments are always created in the same standardized way, which would not be possible on local computers due to different hardware, operating systems, system configurations, etc.. This is an advantage that also Eventbrite realized: They were able to reduce the “mean time to recovery”, i.e. the time to restore a clean state, which is necessary if a developer is stuck or made a mistake. So, an internal Kubernetes platform also reduces the risk of experimentation with Kubernetes making it more attractive for developers.
In general, many developers are interested in Kubernetes and want to learn and work with it, as the most recent Stack Overflow Developer Survey found. With an internal platform, they can get in contact with Kubernetes easily and then develop their skills step by step.
Here, additional Kubernetes tools play an important role. Apparently, Spotify, Eventbrite, and Datadog have developed some kind of tools to make the Kubernetes workflows easier for engineers. At least Spotify’s “tugboat” and Eventbrite’s “yak” thereby do not only serve as platform management tools but also support deployment processes leading to a generally better developer experience with Kubernetes. At Loft, we also see that this is critical for our customers, which is why we have developed a direct Loft integration for DevSpace, so that engineers only need one tool to start their work environment, develop in it, and deploy to it.
An internal Kubernetes platform that also covers some dev tooling helps to standardize Kubernetes workflows and so enables even engineers without Kubernetes experience to work with it. This facilitates the adoption process of the platform and supports the developers in learning more about Kubernetes.
#4. Saving cost
A further driver for the adoption of an internal Kubernetes platform can be cost savings. Generally, it is cheaper to share clusters and a self-service namespace platform allows such cluster sharing. The cost savings are based on more efficient auto-scaling of computing resources and common sharing of resources and some central Kubernetes components (such as API servers). An internal platform can thus lead to cost savings compared to other cloud-based environments, such as individual clusters for developers/teams.
In addition to these direct cost savings, you can also save indirect cost due to the improved productivity when developers do not have to care about their Kubernetes environment themselves anymore and when they can use optimized workflows. As such, the previously mentioned reasons for internal platforms all can ultimately contribute to reduced Kubernetes cost.
Again, I can support this with the experience of our customers, which realized significant cost savings (also due to cost-saving features such as a sleep mode that can be implemented in a common platform). However, the cost was also an important factor for engineering teams to move to their internal offering at Spotify, as James Wen noted in his talk.
Internal Kubernetes platforms are a very cost-efficient solution to provide engineers with a Kubernetes environment because computing resources can be fully utilized and engineers can work more productively.
#Important factors to provide an internal Kubernetes platform
Now, you might want to start setting up your own internal Kubernetes platform. I already wrote a guide on how to build an internal Kubernetes platform and you certainly should watch James Wen’s talk to learn about Spotify’s experience.
However, before you start, there are some important factors that are decisive for the success of your platform that I want to highlight here:
Provide a great developer experience: Even though your platform is for internal use, you have “customers” and these are the developers in your organization. To ensure that they will adopt and like your solution, you need to provide them a great developer experience. In the best case, that means that they can do everything with just one tool, including creating the environment, developing with it, and finally deploying to it. You will also need to have extensive documentation for the developers and this documentation should be written with the developer experience in mind (this can be challenging for the admins who write it). Overall, it is essential that your platform is the “best choice” for the developers, as James Wen described it.
Measure engagement and communicate with users: It is very important to understand your users. Therefore, you have to continuously communicate with the developers, which often happens naturally as the platform providers support the developers in using it. Additionally, you should also implement engagement measures to see if your platform is actually used and to find out what might be missing.
Measure cost and the performance of the platform: You should also measure if your platform works reliably and make sure that you could fix problems fast. Having backups and extensive testing in place are very helpful for this. Here, you should keep in mind that, even though the platform is used “only” internally, it becomes crucial for the productivity of your whole engineering department once everyone is working with it. To support and understand the business case for the platform, measuring the cost before and after the introduction of the platform is also important.
Improve the platform and the dev experience: As you can see from the previous factors, building an internal platform is a continuous effort. Therefore, you always have to keep updating and improving both the stability and performance of the platform as well as the developer experience with it (, which also includes the documentation).
Dev Experience is more important than technical details: For the developers, it matters most that they can efficiently use the platform. They usually do not really care how the underlying tech works exactly or where it runs. (Eventbrite uses EKS, started with one cluster, and now runs several clusters, while Spotify relies on many clusters on GKE and Datadog uses just one self-managed cluster. I believe this choice is not really decisive for the developers.) The team of engineers providing the platform should thus take on a very user-centric perspective. This means that it can be worthwhile to invest even in seemingly minor improvements, which make the life of developers much easier.
Buying may be cheaper than building: As with most software, building it yourself is not always the best solution. Instead, you could simply buy an internal Kubernetes platform solution such as Loft, that is continuously developed by a dedicated team, which may make it better and cheaper for many use cases (especially without very special needs). And even if you decide to build it yourself, you should consider integrating already existing open-source solutions, for example kiosk can help you to implement multi-tenancy, and dev tools like Skaffold or DevSpace may be used as development and deployment tools.
Internal Kubernetes platforms have become more common, even if they are not always called like this. With such a platform, developers get direct access to a Kubernetes environment in the cloud.
Software teams that work with Kubernetes and want to scale their technology should expect to reach a point when local development becomes infeasible or at least inefficient. Then, it will make sense to move to a cloud environment, which should be a Kubernetes environment if your production system is also running on Kubernetes.
Given the time and effort it takes to build such a platform, they should anticipate early on that they may need an internal development platform and start to continuously evaluate the introduction of such a technology, so they can find the right point in time when the investment will pay off.
Once the decision for an internal Kubernetes platform has been made, the focus should be on the developer experience to ensure real adoption and engagement because only then it is possible to realize the full benefits of such a platform.