HuggingFace runtime for MLServer¶
This package provides a MLServer runtime compatible with HuggingFace Transformers.
Usage¶
You can install the runtime, alongside mlserver
, as:
pip install mlserver mlserver-huggingface
For further information on how to use MLServer with HuggingFace, you can check out this worked out example.
Settings¶
The HuggingFace runtime exposes a couple extra parameters which can be used to
customise how the runtime behaves.
These settings can be added under the parameters.extra
section of your
model-settings.json
file, e.g.
{
"name": "qa",
"implementation": "mlserver_huggingface.HuggingFaceRuntime",
"parameters": {
"extra": {
"task": "question-answering",
"optimum_model": true
}
}
}
Note
These settings can also be injected through environment variables prefixed with MLSERVER_MODEL_HUGGINGFACE_
, e.g.
MLSERVER_MODEL_HUGGINGFACE_TASK="question-answering"
MLSERVER_MODEL_HUGGINGFACE_OPTIMUM_MODEL=true
Reference¶
You can find the full reference of the accepted extra settings for the HuggingFace runtime below:
- pydantic settings mlserver_huggingface.settings.HuggingFaceSettings¶
Parameters that apply only to HuggingFace models
- Config:
env_prefix: str = MLSERVER_MODEL_HUGGINGFACE_
- Fields:
- field device: int = -1¶
Device in which this pipeline will be loaded (e.g., “cpu”, “cuda:1”, “mps”, or a GPU ordinal rank like 1).
- field framework: str | None = None¶
The framework to use, either “pt” for PyTorch or “tf” for TensorFlow.
- field inter_op_threads: int | None = None¶
Threads used for parallelism between independent operations. PyTorch: https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html Tensorflow: https://www.tensorflow.org/api_docs/python/tf/config/threading/set_inter_op_parallelism_threads
- field intra_op_threads: int | None = None¶
Threads used within an individual op for parallelism. PyTorch: https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html Tensorflow: https://www.tensorflow.org/api_docs/python/tf/config/threading/set_intra_op_parallelism_threads
- field optimum_model: bool = False¶
Flag to decide whether the pipeline should use a Optimum-optimised model or the standard Transformers model. Under the hood, this will enable the model to use the optimised ONNX runtime.
- field pretrained_model: str | None = None¶
Name of the model that should be loaded in the pipeline.
- field pretrained_tokenizer: str | None = None¶
Name of the tokenizer that should be loaded in the pipeline.
- field task: str = ''¶
Pipeline task to load. You can see the available Optimum and Transformers tasks available in the links below:
- field task_suffix: str = ''¶
Suffix to append to the base task name. Useful for, e.g. translation tasks which require a suffix on the task name to specify source and target.
- property task_name¶