Kubernetes: Virtual Clusters For AI & ML Experiments

thumbnail for this post

Artificial Intelligence (AI) and Machine Learning (ML) have been some of the hottest IT topics in recent years. A recent O’Reilly survey discovered that 85% of companies are already using AI or are evaluating it. Of course, such an impressive majority of companies is using AI and ML for a good reason: According to a McKinsey survey, 63% of companies are able to increase revenues and 44% reduce cost with Artificial Intelligence. This has made the efficient implementation of AI an important competitive advantage.

AI Adoption Challenges

Still, there are some challenges many companies face during their AI adoption journey, which were surveyed by O’Reilly: One of the top 3 issues is the lack of skilled AI/ML talent. For this, one can conclude that the productivity of the existing engineers is also an important factor for efficient AI adoption.

Here, technical aspects also come into play and two of these were even mentioned in the top 10 main challenges: Workflow Reproducibility and Technical Infrastructure.

The three challenges engineering productivity, workflow reproducibility, and technical infrastructure can be directly addressed by virtual Kubernetes clusters (vClusters), which I will describe next.

In a separate post, we describe what virtual Kubernetes clusters are and how they work.

1. Engineering Productivity

Challenge

Given the scarcity of suitable candidates for the many open AI positions in the market, the productivity of the available engineers should be of upmost importance. This means that the engineers should be enabled to work on the software and should not spend time unproductively, e.g. with waiting or managing execution environments for the software.

Impact of Virtual Clusters

Availability: With virtual Kubernetes clusters, engineers always have an individual execution environment at hand that can be started within seconds on-demand. That means that engineers do not have to wait until an admin creates such an environment for them or until a test of a colleague who is using a shared central testing cluster at the same time is finished.

Scalability: Often, artificial intelligence and machine learning software requires a lot of computing resources. With vClusters, the AI engineers always have the computing resources when they need them, which also accelerates the execution of the experiments. This scalability that is provided by Kubernetes itself is available for engineers even during the early stages of development and not only in production later on.

2. Workflow Reproducibility

Challenge

Tests and experiments need to be executed several times, often with slightly different parameters. To get meaningful results and to keep efficiency high, the workflows to start these experiments should be easy, fast, and unsusceptible to errors.

Impact of Virtual Clusters

Recoverability: Since Kubernetes itself is declarative, it is relatively easy to recreate identical Kubernetes environments. The same goes of course for virtual Kubernetes clusters. With vClusters, it is even easier and faster to create a fresh environment with the exact same specifications.

Parallelization: Working with a virtual cluster platform, engineers are usually allowed to create multiple vClusters themselves without having cluster admin rights for the physical cluster. For this, engineers can even run experiments in parallel, which speeds up the feedback cycle (and again contributes to the engineering productivity and velocity, see Challenge 1).

3. Technical Infrastructure

Challenge

A major challenge with the technical infrastructure is how to give the engineers access to it. In general, there are different ways to solve this, e.g. with individual clusters or shared clusters. The problem here is often that the solutions either are very inflexible and vulnerable (limited user isolation for shared clusters) or become a management nightmare for the admins (too many clusters to maintain).

Impact of Virtual Clusters

Flexibility: Due to the strong user isolation with virtual clusters (compared to Kubernetes namespaces), engineers are very flexible in their use of their individual environments. They can create vClusters on-demand, configure them as they need it (they can even choose which Kubernetes version they want to use), and finally can delete them without an impact on their colleagues.

Manageability: While the engineers work in individual sandbox environments that feel like “real” Kubernetes clusters, the underlying physical cluster is shared. As a cluster admin, you only have to manage one physical cluster, which is comparably simple and overseeable, especially because this cluster does not need any complex installations or features. The user management to determine who can create vClusters and how many of them can also be centrally controlled, e.g. with engineer-friendly platforms such as loft.

4. Cost Concerns

Challenge

Even though cost were not mentioned in the aforementioned O’Reilly study, cost is always a concern, especially in cases of resource-intense Artificial Intelligence or Machine Learning applications. Cost will be even more in the focus if the diffusion of AI in organizations advances.

Impact of Virtual Clusters

Utilization: There is only one central physical cluster that should support auto-scaling and thus produces hardly any cost when not used and the AI engineers then create virtual clusters on top of it when they need them. This eliminates the need for expensive standby clusters that are unused most of the time.

Sleep Mode: Virtual cluster solutions such as loft provide a sleep mode that automatically sends vClusters to sleep when they are not needed, so that costly idle times are reduced. When an engineer needs the cluster again, it also wakes up automatically, which ensures that the engineer is not affected, in fact, they do not need to actively interact with the sleep mode feature at all.

How to Get Virtual Kubernetes Clusters

Virtual Kubernetes clusters are still a very new concept but there are already some solutions for it. From the community, there were some early proof-of-concepts such as k3v or the project from the Kubernetes multi-tenancy SIG.

While these are rather fundamental implementations, loft provides a more comprehensive solution that, besides the actual virtual cluster technology, comprises a user management with Single Sign-On, an engineer-friendly graphical UI, and the previously mentioned sleep mode. With such a solution, it is possible to leverage all benefits of virtual clusters for AI and ML scenarios.

Virtual clusters can of course also be used in scenarios outside of AI and ML, such as cloud-native development or CI/CD and testing. For a general overview of the benefits and use cases of virtual clusters, take a look at this post.

Conclusion

For companies that already rely on Kubernetes for their AI and ML software, virtual clusters can solve two of the most common technical hurdles for AI adoption: How to manage the technical infrastructure provisioning and how to allow repeatable workflows. Additionally, vClusters help to boost engineers’ productivity, which is particularly valuable in the face of a shortage of skilled engineers. Finally, virtual Kubernetes clusters also help to reduce cost, which is an always relevant topic that becomes even more prevalent if the AI adoption within organizations progresses further.

For this, virtual clusters could pave the way for further AI adoption and so help even more companies to realize the full benefits of AI, such as increased revenue and reduced operational cost.


Photo by Upal Patel on Unsplash