Deploying Machine Learning Models on Kubernetes with vCluster Tutorial

Lukas Gentele
Sooter Saalu
8 min read

The rapid advancements in generative AI have propelled machine learning and artificial intelligence from academia to real-world applications. Despite this progress, organizations face a persistent bottleneck in deploying models effectively, which impacts their ability to optimize impact, performance, and cost-effectiveness. Choosing the right deployment platform involves considering data availability, business alignment, scalability, and user needs, which influence the deployment and long-term model monitoring and management.

vCluster is a powerful solution for machine learning deployment. Built on the Kubernetes platform, vCluster facilitates the creation of multitenant virtual Kubernetes clusters. These segmented clusters ensure isolation and scalability, which are crucial for both testing and production deployments.

This article explores the benefits of using virtual Kubernetes clusters for ML model deployment and how to deploy ML models with KServe, a Kubeflow component that focuses on ML model deployment, serverless inferencing, monitoring, and management. You’ll also learn about tools and strategies for managing and monitoring your models.

#Benefits of vCluster for ML Model Deployment

Kubernetes clusters provide essential features for your ML workflow, and vCluster enables the creation of multitenant virtual clusters. This allows you to partition larger clusters into shareable segments, ensuring proper isolation and scalability for testing or production deployments within subclusters.

vCluster can transform your ML model deployment by providing highly configurable virtual clusters with very low overhead. Your created virtual cluster will have access to the full spectrum of Kubernetes control plane options, allowing you to run different Kubernetes configurations and adapt to the demands of your machine learning workloads. This is especially valuable in multitenant environments with varying levels of priority.

With vCluster, you can establish isolated environments for distinct tasks or projects, a beneficial feature for ML workflows. This isolation safeguards against interference or conflicts between experiments, allowing each to operate independently.

In addition to scalability and environment isolation, vCluster optimizes hardware usage, minimizing costs by running multiple clusters on a single physical machine. It also facilitates controlled testing and validation of ML models, guaranteeing comprehensive model evaluation prior to deployment.

#Setting Up vCluster for Kubernetes Deployment

vCluster runs on top of your host Kubernetes cluster and can be installed as a CLI tool across various systems, including Mac, Linux, and Windows. It uses the resources of your host cluster, running and scheduling your workloads by replicating pods from the virtual cluster to the host cluster with no performance degradation.

A host cluster is needed for vCluster’s operation; however, installing vCluster can be easily done by downloading the binary release for your system and executing the installation scripts.

The following is an example installation for a Linux AMD terminal:

curl -L -o vcluster "" && sudo install -c -m 0755 vcluster /usr/local/bin && rm -f vcluster

This should give you the following output as the tool downloads and installs:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 43.2M  100 43.2M    0     0  4414k      0  0:00:10  0:00:10 --:--:-- 5596k

This sets up vCluster as an executable command line tool.

You can then create a virtual cluster using the following command:

vcluster create <Name-of-Virtual-Cluster>

For additional configuration and resource management of your virtual clusters, you can attach a YAML configuration file when using vCluster:

vcluster create my-vcluster -f values.yaml

#Deploying an ML Model Using Kubeflow’s KServe on vCluster

Kubeflow leverages containerized components, enabling the creation of distinct modules for data processing, model training, deployment, and monitoring. These components can be linked together to form a cohesive data pipeline or deployed as independent entities. This approach focuses on creating modular data flows where self-contained functions are designed for specific purposes. Components can be implemented as either Python components or image containers.

To deploy your model on the cloud, you must package it as a container image and make it accessible through artifact repositories or store it in object storage systems such as Amazon S3 and Google Cloud Storage buckets.

You can use a similar process to install KServe in your vCluster space as you would on a host Kubernetes cluster.

Let’s run through the step-by-step process of creating a Kubernetes cluster with minikube, setting up a virtual cluster for testing, installing KServe, and deploying a model locally.

#Machine Learning Model Deployment

Begin by creating a cluster with a Kubernetes version compatible with KServe. This ensures seamless integration with the KServe ecosystem. For instance, you can use the command below to start minikube with a specified Kubernetes version:

minikube start --kubernetes-version=v1.25.0

Next, build a virtual cluster on top of your host cluster. This step is pivotal in creating isolated environments for your model deployments. The virtual cluster (sample-cluster in this instance) provides a dedicated space for your models to operate efficiently:

vcluster create sample-cluster

Install KServe, Kubeflow’s model deployment extension, using the provided quick install script. This step enables you to deploy and manage your ML models within your vCluster environment:

curl -s "" | bash

This script creates the necessary services and resources needed to run KServe. You can check that they are all running by using the following command:

kubectl get pods -A

Next, serialize and store your model for easy accessibility by KServe. Here’s a sample model being created and serialized:

from sklearn import svm
from sklearn import datasets
from joblib import dump

iris = datasets.load_iris()
X, y =,

clf = svm.SVC(gamma='scale'), y)

dump(clf, 'model.joblib')

In this example, a sample SVM model is created and serialized using the joblib library. You can extend this serialized model to Google Cloud Storage to make it accessible for your deployment component.

Create a dedicated namespace, kserve-test, for deploying your models. This ensures clear organization and separation of resources within your Kubernetes cluster:

kubectl create namespace kserve-test

Then, apply the provided YAML configuration to create the model deployment service. This YAML defines an InferenceService for the model and specifies the Cloud Storage URI for the model artifacts. Applying it within the kserve-sample namespace sets up the necessary resources:

kubectl apply -n kserve-test -f - <<EOF
apiVersion: ""
kind: "InferenceService"
  name: "sklearn-iris"
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

Check the status of your InferenceService to confirm deployment. It might take a minute or so for it to gain full functionality:

kubectl get inferenceservices sklearn-iris -n kserve-test

Pod status

You can now query your deployed model for predictions using curl commands or integrate these commands into your applications. Your connection to your deployed model depends on your Kubernetes environment and the availability of an external IP for your model service.

You can port forward requests to direct inference requests using the following command:

INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0]}')

kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80

Then, open a second terminal for your requests and set the following variables for your local connection:

export INGRESS_HOST=localhost
export INGRESS_PORT=8080

The provided example demonstrates sending input data in JSON format to get predictions from the deployed sklearn-iris model:

cat <<EOF > "./iris-input.json"
  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]

This command creates an iris-input.json file with sample content for your model inferences.

Next, you can query the deployed model with this file to produce predictions:

SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n kserve-test -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json

This command should produce an output similar to the following:

*   Trying
* Connected to localhost ( port 8080 (#0)
> POST /v1/models/sklearn-iris:predict HTTP/1.1
> Host:
> User-Agent: curl/7.68.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 76
* upload completely sent off: 76 out of 76 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 21
< content-type: application/json
< date: Fri, 13 Oct 2023 14:44:24 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 631
* Connection #0 to host localhost left intact

This step completes the process of deploying your machine learning model with KServe. KServe allows you to interact with your deployed ML model and receive predictions based on input data.

#Managing and Monitoring ML Model Performance in vCluster

Deploying models with KServe lets you take advantage of the comprehensive Kubernetes ecosystem with tools across the end-to-end pipeline of model development, deployment, and management. In a Kubenetes deployment, you have access to autoscaling to ensure your models are available. Your KServe deployments can also be integrated with Prometheus monitoring and FastAPI documentation right from your YAML declaration. By default, you can access a Model UI to manage your models and integrate tools such as Alibi Outlier/Drift Detector to monitor your model performance directly.

Within your virtual cluster, you have the flexibility to fine-tune and adjust resources according to your workloads in order to ensure optimal utilization. This capability mirrors that of the host cluster but with added isolation, making it ideal for creating separate environments for testing and production purposes.

Once models are serialized and stored, you have the option to make updates directly in Cloud Storage. This process offers convenient methods for programmatic or automatic model versioning and replacement, such as canary rollouts, ensuring smooth transitions between model versions and continuous improvement in your model inferences.


This article outlined the advantages of utilizing a virtual Kubernetes cluster. You learned how to deploy a machine learning model in your vCluster-enabled environment. You also explored the management and monitoring of ML models with tools and strategies that can be efficiently added to your workflow.

vCluster and KServe together form a powerful combination for deploying ML models. vCluster offers a flexible and efficient way to manage resources, ensuring optimal performance and isolation for different environments. KServe offers a complete platform with ML workflow-specific tools and components. Its containerized architecture simplifies model deployment.

Try out vCluster today to explore its benefits for your testing and production environments. It’s an open source project with a thriving Slack community.

Sign up for our newsletter

Be the first to know about new features, announcements and industry insights.