Deploying a Custom Tensorflow Model with MLServer and Seldon Core



This tutorial walks through the steps required to take a python ML model from your machine to a production deployment on Kubernetes. More specifically we’ll cover:

  • Running the model locally

  • Turning the ML model into an API

  • Containerizing the model

  • Storing the container in a registry

  • Deploying the model to Kubernetes (with Seldon Core)

  • Scaling the model

The tutorial comes with an accompanying video which you might find useful as you work through the steps: video_play_icon

The slides used in the video can be found here.

The Use Case

For this tutorial, we’re going to use the Cassava dataset available from the Tensorflow Catalog. This dataset includes leaf images from the cassava plant. Each plant can be classified as either “healthly” or as having one of four diseases (Mosaic Disease, Bacterial Blight, Green Mite, Brown Streak Disease).


We won’t go through the steps of training the classifier. Instead, we’ll be using a pre-trained one available on TensorFlow Hub. You can find the model details here.

Getting Set Up

The easiest way to run this example is to clone the repository located here:

git clone

If you’ve already cloned the MLServer repository, you can also find it in docs/examples/cassava.

Once you’ve done that, you can just run:

cd cassava-example/
pip install -r requirements.txt

And it’ll set you up with all the libraries required to run the code.

Running The Python App

The starting point for this tutorial is python script This is typical of the kind of python code we’d run standalone or in a jupyter notebook. Let’s familiarise ourself with the code:

from helpers import plot, preprocess
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub

# Fixes an issue with Jax and TF competing for GPU
tf.config.experimental.set_visible_devices([], 'GPU')

# Load the model
model_path = './model'
classifier = hub.KerasLayer(model_path)

# Load the dataset and store the class names
dataset, info = tfds.load('cassava', with_info=True)
class_names = info.features['label'].names + ['unknown']

# Select a batch of examples and plot them
batch_size = 9
batch = dataset['validation'].map(preprocess).batch(batch_size).as_numpy_iterator()
examples = next(batch)
plot(examples, class_names)

# Generate predictions for the batch and plot them against their labels
predictions = classifier(examples['image'])
predictions_max = tf.argmax(predictions, axis=-1)
plot(examples, class_names, predictions_max)

First up, we’re importing a couple of functions from our file:

  • plot provides the visualisation of the samples, labels and predictions.

  • preprocess is used to resize images to 224x224 pixels and normalize the RGB values.

The rest of the code is fairly self-explanatory from the comments. We load the model and dataset, select some examples, make predictions and then plot the results.

Try it yourself by running:


Here’s what our setup currently looks like: step_1

Creating an API for The Model

The problem with running our code like we did earlier is that it’s not accessible to anyone who doesn’t have the python script (and all of it’s dependencies). A good way to solve this is to turn our model into an API.

Typically people turn to popular python web servers like Flask or FastAPI. This is a good approach and gives us lots of flexibility but it also requires us to do a lot of the work ourselves. We need to impelement routes, set up logging, capture metrics and define an API schema among other things. A simpler way to tackle this problem is to use an inference server. For this tutorial we’re going to use the open source MLServer framework.

MLServer supports a bunch of inference runtimes out of the box, but it also supports custom python code which is what we’ll use for our Tensorflow model.

Setting Things Up

In order to get our model ready to run on MLServer we need to wrap it in a single python class with two methods, load() and predict(). Let’s take a look at the code (found in model/

from mlserver import MLModel
from mlserver.codecs import decode_args
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub

# Define a class for our Model, inheriting the MLModel class from MLServer
class CassavaModel(MLModel):

  # Load the model into memory
  async def load(self) -> bool:
    tf.config.experimental.set_visible_devices([], 'GPU')
    model_path = '.'
    self._model = hub.KerasLayer(model_path)
    self.ready = True
    return self.ready

  # Logic for making predictions against our model
  async def predict(self, payload: np.ndarray) -> np.ndarray:
    # convert payload to tf.tensor
    payload_tensor = tf.constant(payload)

    # Make predictions
    predictions = self._model(payload_tensor)
    predictions_max = tf.argmax(predictions, axis=-1)

    # convert predictions to np.ndarray
    response_data = np.array(predictions_max)

    return response_data

The load() method is used to define any logic required to set up our model for inference. In our case, we’re loading the model weights into self._model. The predict() method is where we include all of our prediction logic.

You may notice that we’ve slightly modified our code from earlier (in The biggest change is that it is now wrapped in a single class CassavaModel.

The only other task we need to do to run our model on MLServer is to specify a model-settings.json file:

    "name": "cassava",
    "implementation": "serve-model.CassavaModel"

This is a simple configuration file that tells MLServer how to handle our model. In our case, we’ve provided a name for our model and told MLServer where to look for our model class (serve-model.CassavaModel).

Serving The Model

We’re now ready to serve our model with MLServer. To do that we can simply run:

mlserver start model/

MLServer will now start up, load our cassava model and provide access through both a REST and gRPC API.

Making Predictions Using The API

Now that our API is up and running. Open a new terminal window and navigate back to the root of this repository. We can then send predictions to our api using the file by running:

python --local

Our setup has now evloved and looks like this: step_2

Containerizing The Model

Containers are an easy way to package our application together with it’s runtime and dependencies. More importantly, containerizing our model allows it to run in a variety of different environments.

Note: you will need Docker installed to run this section of the tutorial. You’ll also need a docker hub account or another container registry.

Taking our model and packaging it into a container manually can be a pretty tricky process and requires knowledge of writing Dockerfiles. Thankfully MLServer removes this complexity and provides us with a simple build command.

Before we run this command, we need to provide our dependencies in either a requirements.txt or a conda.env file. The requirements file we’ll use for this example is stored in model/requirements.txt:


Notice that we didn’t need to include mlserver in our requirements? That’s because the builder image has mlserver included already.

We’re now ready to build our container image using:

mlserver build model/ -t [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]

Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. “bobsmith/cassava”.

MLServer will now build the model into a container image for us. We can check the output of this by running:

docker images

Finally, we want to send this container image to be stored in our container registry. We can do this by running:


Our setup now looks like this. Where our model has been packaged and sent to a container registry: step_3

Deploying to Kubernetes

Now that we’ve turned our model into a production-ready API, containerized it and pushed it to a registry, it’s time to deploy our model.

We’re going to use a popular open source framework called Seldon Core to deploy our model. Seldon Core is great because it combines all of the awesome cloud-native features we get from Kubernetes but it also adds machine-learning specific features.

This tutorial assumes you already have a Seldon Core cluster up and running. If that’s not the case, head over the installation instructions and get set up first. You’ll also need to install the kubectl command line interface.

Creating the Deployment

To create our deployment with Seldon Core we need to create a small configuration file that looks like this:

You can find this file named deployment.yaml in the base folder of this tutorial’s repository.

kind: SeldonDeployment
  name: cassava
  protocol: v2
    - componentSpecs:
        - spec:
                name: cassava
                imagePullPolicy: Always
        name: cassava
        type: MODEL
      name: cassava

Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. “bobsmith/cassava”.

We can apply this configuration file to our Kubernetes cluster just like we would for any other Kubernetes object using:

kubectl create -f deployment.yaml

To check our deployment is up and running we can run:

kubectl get pods

We should see STATUS = Running once our deployment has finalized.

Testing the Deployment

Now that our model is up and running on a Kubernetes cluster (via Seldon Core), we can send some test inference requests to make sure it’s working.

To do this, we simply run the file in the following way:

python --remote

This script will randomly select some test samples, send them to the cluster, gather the predictions and then plot them for us.

A note on running this yourself: This example is set up to connect to a kubernetes cluster running locally on your machine. If yours is local too, you’ll need to make sure you port forward before sending requests. If your cluster is remote, you’ll need to change the inference_url variable on line 21 of

Having deployed our model to kubernetes and tested it, our setup now looks like this: step_4

Scaling the Model

Our model is now running in a production environment and able to handle requests from external sources. This is awesome but what happens as the number of requests being sent to our model starts to increase? Eventually, we’ll reach the limit of what a single server can handle. Thankfully, we can get around this problem by scaling our model horizontally.

Kubernetes and Seldon Core make this really easy to do by simply running:

kubectl scale sdep cassava --replicas=3

We can replace the --replicas=3 with any number we want to scale to.

To watch the servers scaling out we can run:

kubectl get pods --watch

Once the new replicas have finished rolling out, our setup now looks like this: step_5

In this tutorial we’ve scaled the model out manually to show how it works. In a real environment we’d want to set up auto-scaling to make sure our prediction API is always online and performing as expected.