HuggingFace runtime for MLServer¶

This package provides a MLServer runtime compatible with HuggingFace Transformers.

Usage¶

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-huggingface

For further information on how to use MLServer with HuggingFace, you can check out this worked out example.

Settings¶

The HuggingFace runtime exposes a couple extra parameters which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.

{
  "name": "qa",
  "implementation": "mlserver_huggingface.HuggingFaceRuntime",
  "parameters": {
    "extra": {
      "task": "question-answering",
      "optimum_model": true
    }
  }
}

Note

These settings can also be injected through environment variables prefixed with MLSERVER_MODEL_HUGGINGFACE_, e.g.

MLSERVER_MODEL_HUGGINGFACE_TASK="question-answering"
MLSERVER_MODEL_HUGGINGFACE_OPTIMUM_MODEL=true

Reference¶

You can find the full reference of the accepted extra settings for the HuggingFace runtime below:

pydantic settings mlserver_huggingface.settings.HuggingFaceSettings¶

Parameters that apply only to HuggingFace models

Config:

env_prefix: str = MLSERVER_MODEL_HUGGINGFACE_

Fields:

device (int)
framework (str | None)
inter_op_threads (int | None)
intra_op_threads (int | None)
optimum_model (bool)
pretrained_model (str | None)
pretrained_tokenizer (str | None)
task (str)
task_suffix (str)

field device: int = -1¶: Device in which this pipeline will be loaded (e.g., “cpu”, “cuda:1”, “mps”, or a GPU ordinal rank like 1).

field framework: str | None = None¶: The framework to use, either “pt” for PyTorch or “tf” for TensorFlow.

field inter_op_threads: int | None = None¶: Threads used for parallelism between independent operations. PyTorch: https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html Tensorflow: https://www.tensorflow.org/api_docs/python/tf/config/threading/set_inter_op_parallelism_threads

field intra_op_threads: int | None = None¶: Threads used within an individual op for parallelism. PyTorch: https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html Tensorflow: https://www.tensorflow.org/api_docs/python/tf/config/threading/set_intra_op_parallelism_threads

field optimum_model: bool = False¶: Flag to decide whether the pipeline should use a Optimum-optimised model or the standard Transformers model. Under the hood, this will enable the model to use the optimised ONNX runtime.

field pretrained_model: str | None = None¶: Name of the model that should be loaded in the pipeline.

field pretrained_tokenizer: str | None = None¶: Name of the tokenizer that should be loaded in the pipeline.

field task: str = ''¶

Pipeline task to load. You can see the available Optimum and Transformers tasks available in the links below:

field task_suffix: str = ''¶: Suffix to append to the base task name. Useful for, e.g. translation tasks which require a suffix on the task name to specify source and target.

property task_name¶