MLModel¶

The MLModel class is the base class for all custom inference runtimes. It exposes the main interface that MLServer will use to interact with ML models.

The bulk of its public interface are the load(), unload() and predict() methods. However, it also contains helpers with encoding / decoding of requests and responses, as well as properties to access the most common bits of the model’s metadata.

When writing custom runtimes, this class should be extended to implement your own load and predict logic.

class mlserver.MLModel(settings: ModelSettings)¶

Abstract inference runtime which exposes the main interface to interact with ML models.

async load() → bool¶

Method responsible for loading the model from a model artefact. This method will be called on each of the parallel workers (when parallel inference) is enabled). Its return value will represent the model’s readiness status. A return value of True will mean the model is ready.

This method can be overriden to implement your custom load logic.

async predict(payload: InferenceRequest) → InferenceResponse¶

Method responsible for running inference on the model.

This method can be overriden to implement your custom inference logic.

async predict_stream(payloads: AsyncIterator[InferenceRequest]) → AsyncIterator[InferenceResponse]¶

Method responsible for running generation on the model, streaming a set of responses back to the client.

This method can be overriden to implement your custom inference logic.

async unload() → bool¶

Method responsible for unloading the model, freeing any resources (e.g. CPU memory, GPU memory, etc.). This method will be called on each of the parallel workers (when parallel inference) is enabled). A return value of True will mean the model is now unloaded.

This method can be overriden to implement your custom unload logic.

property name: str¶

Model name, from the model settings.

property version: str | None¶

Model version, from the model settings.

property settings: ModelSettings¶

Model settings.

property inputs: List[MetadataTensor] | None¶

Expected model inputs, from the model settings.

Note that this property can also be modified at model’s load time to inject any inputs metadata.

property outputs: List[MetadataTensor] | None¶

Expected model outputs, from the model settings.

Note that this property can also be modified at model’s load time to inject any outputs metadata.

decode(request_input: RequestInput, default_codec: Type[InputCodec] | InputCodec | None = None) → Any¶

Helper to decode a request input into its corresponding high-level Python object. This method will find the most appropiate input codec based on the model’s metadata and the input’s content type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

decode_request(inference_request: InferenceRequest, default_codec: Type[RequestCodec] | RequestCodec | None = None) → Any¶

Helper to decode an inference request into its corresponding high-level Python object. This method will find the most appropiate request codec based on the model’s metadata and the requests’s content type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

encode_response(payload: Any, default_codec: Type[RequestCodec] | RequestCodec | None = None) → InferenceResponse¶

Helper to encode a high-level Python object into its corresponding inference response. This method will find the most appropiate request codec based on the payload’s type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

encode(payload: Any, request_output: RequestOutput, default_codec: Type[InputCodec] | InputCodec | None = None) → ResponseOutput¶

Helper to encode a high-level Python object into its corresponding response output. This method will find the most appropiate input codec based on the model’s metadata, request output’s content type or payload’s type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.