HuggingFace runtime for MLServer

This package provides a MLServer runtime compatible with HuggingFace Transformers.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-huggingface

For further information on how to use MLServer with HuggingFace, you can check out this worked out example.

Settings

The HuggingFace runtime exposes a couple extra parameters which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.

{
  "name": "qa",
  "implementation": "mlserver_huggingface.HuggingFaceRuntime",
  "parameters": {
    "extra": {
      "task": "question-answering",
      "optimum_model": true
    }
  }
}

Note

These settings can also be injected through environment variables prefixed with MLSERVER_MODEL_HUGGINGFACE_, e.g.

MLSERVER_MODEL_HUGGINGFACE_TASK="question-answering"
MLSERVER_MODEL_HUGGINGFACE_OPTIMUM_MODEL=true

Reference

You can find the full reference of the accepted extra settings for the HuggingFace runtime below:

pydantic settings mlserver_huggingface.settings.HuggingFaceSettings

Parameters that apply only to HuggingFace models

Config:
  • env_prefix: str = MLSERVER_MODEL_HUGGINGFACE_

Fields:
field device: int = -1

Device in which this pipeline will be loaded (e.g., “cpu”, “cuda:1”, “mps”, or a GPU ordinal rank like 1).

field framework: str | None = None

The framework to use, either “pt” for PyTorch or “tf” for TensorFlow.

field inter_op_threads: int | None = None

Threads used for parallelism between independent operations. PyTorch: https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html Tensorflow: https://www.tensorflow.org/api_docs/python/tf/config/threading/set_inter_op_parallelism_threads

field intra_op_threads: int | None = None

Threads used within an individual op for parallelism. PyTorch: https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html Tensorflow: https://www.tensorflow.org/api_docs/python/tf/config/threading/set_intra_op_parallelism_threads

field optimum_model: bool = False

Flag to decide whether the pipeline should use a Optimum-optimised model or the standard Transformers model. Under the hood, this will enable the model to use the optimised ONNX runtime.

field pretrained_model: str | None = None

Name of the model that should be loaded in the pipeline.

field pretrained_tokenizer: str | None = None

Name of the tokenizer that should be loaded in the pipeline.

field task: str = ''

Pipeline task to load. You can see the available Optimum and Transformers tasks available in the links below:

field task_suffix: str = ''

Suffix to append to the base task name. Useful for, e.g. translation tasks which require a suffix on the task name to specify source and target.

property task_name