OpenVINO IR Model Conversion Tool

Enter model information to generate an optimum-cli export command. Use the arguments below to configure the export process based on the OpenVINO exporter documentation. Then run the generated command in the terminal where your OpenArc environment is activated.

Model

Model ID on huggingface.co or path on disk to load model from.

Output Directory

Path indicating the directory where to store the generated OV model.

Task

The task to export the model for. If not specified, the task will be auto-inferred based on metadata in the model repository.

Framework

The framework to use for the export. If not provided, will attempt to use the local checkpoint's original framework or what is available in the environment.

Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code present in the model repository.

Trust Remote Code

Weight Format

The weight format of the exported model.

Quantization Mode

Quantization precision mode. This is used for applying full model quantization including activations.

Library

The library used to load the model before export. If not provided, will attempt to infer the local checkpoint's library

Cache Directory

The path to a directory in which the downloaded model should be cached if the standard cache should not be used.

Pad Token ID

This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.

Variant

If specified load weights from variant filename.

Ratio

A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80%% of the layers will be quantized to int4 while 20%% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. Default value is 1.0. Note: If dataset is provided, and the ratio is less than 1.0, then data-aware mixed precision assignment will be applied.

Whether to apply symmetric quantization

Symmetric Quantization

Group Size

The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.

Backup Precision

Defines a backup precision for mixed-precision weight compression. Only valid for 4-bit weight formats. If not provided, backup precision is int8_asym. 'none' stands for original floating-point precision of the model weights, in this case weights are retained in their original precision without any quantization. 'int8_sym' stands for 8-bit integer symmetric quantization without zero point. 'int8_asym' stands for 8-bit integer asymmetric quantization with zero points per each quantization group.

Dataset

The dataset used for data-aware compression or quantization with NNCF. For language models you can use the one from the list ['auto','wikitext2','c4','c4-new']. With 'auto' the dataset will be collected from model's generations. For diffusion models it should be on of ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit']. For visual language models the dataset must be set to 'contextual'. Note: if none of the data-aware compression algorithms are selected and ratio parameter is omitted or equals 1.0, the dataset argument will not have an effect on the resulting model.

Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied, they are compressed to INT8.

All Layers

Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but requires additional time for tuning weights on a calibration dataset. To run AWQ, please also provide a dataset argument. Note: it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.

AWQ

Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed layers. Providing a dataset is required to run scale estimation. Please note, that applying scale estimation takes additional memory and time.

Scale Estimation

Indicates whether to apply GPTQ algorithm that optimizes compressed weights in a layer-wise fashion to minimize the difference between activations of a compressed and original layer. Please note, that applying GPTQ takes additional memory and time.

GPTQ

Indicates whether to apply LoRA Correction algorithm. When enabled, this algorithm introduces low-rank adaptation layers in the model that can recover accuracy after weight compression at some cost of inference latency. Please note, that applying LoRA Correction algorithm takes additional memory and time.

LoRA Correction

Sensitivity Metric

The sensitivity metric for assigning quantization precision to layers. It can be one of the following: ['weight_quantization_error', 'hessian_input_activation', 'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude'].

Number of Samples

The maximum number of samples to take from the dataset for quantization.

Smooth Quant Alpha

SmoothQuant alpha parameter that improves the distribution of activations before MatMul layers and reduces quantization error. Valid only when activations quantization is enabled.

Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. In stateful models all kv-cache inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-stateful option is used, it may result in sub-optimal inference performance. Use it when you intentionally want to use a stateless model, for example, to be compatible with existing OpenVINO native inference code that expects KV-cache inputs and outputs in the model.

Disable Stateful

Do not add converted tokenizer and detokenizer OpenVINO models.

Disable Convert Tokenizer

Generated Command