# DeiT

## Overview

DeiT モデルは、Hugo Touvron、Matthieu Cord、Matthijs Douze、Francisco Massa、Alexandre
Sablayrolles, Hervé Jégou.によって [Training data-efficient image Transformers & distillation through attention](https://huggingface.co/papers/2012.12877) で提案されました。
サブレイロール、エルヴェ・ジェグー。 [Dosovitskiy et al., 2020](https://huggingface.co/papers/2010.11929) で紹介された [Vision Transformer (ViT)](vit) は、既存の畳み込みニューラルと同等、またはそれを上回るパフォーマンスを発揮できることを示しました。
Transformer エンコーダ (BERT のような) を使用したネットワーク。ただし、その論文で紹介された ViT モデルには、次のトレーニングが必要でした。
外部データを使用して、数週間にわたる高価なインフラストラクチャ。 DeiT (データ効率の高い画像変換器) はさらに優れています
画像分類用に効率的にトレーニングされたトランスフォーマーにより、必要なデータとコンピューティング リソースがはるかに少なくなります。
オリジナルの ViT モデルとの比較。

論文の要約は次のとおりです。

*最近、純粋に注意に基づくニューラル ネットワークが、画像などの画像理解タスクに対処できることが示されました。
分類。ただし、これらのビジュアル トランスフォーマーは、
インフラストラクチャが高価であるため、その採用が制限されています。この作業では、コンボリューションフリーの競争力のあるゲームを作成します。
Imagenet のみでトレーニングしてトランスフォーマーを作成します。 1 台のコンピューターで 3 日以内にトレーニングを行います。私たちの基準となるビジョン
トランス (86M パラメータ) は、外部なしで ImageNet 上で 83.1% (単一クロップ評価) のトップ 1 の精度を達成します。
データ。さらに重要なのは、トランスフォーマーに特有の教師と生徒の戦略を導入することです。蒸留に依存している
学生が注意を払って教師から学ぶことを保証するトークン。私たちはこのトークンベースに興味を示します
特に convnet を教師として使用する場合。これにより、convnet と競合する結果を報告できるようになります。
Imagenet (最大 85.2% の精度が得られます) と他のタスクに転送するときの両方で。私たちはコードを共有し、
モデル。*

このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。このモデルの TensorFlow バージョンは、[amyeroberts](https://huggingface.co/amyeroberts) によって追加されました。

## Usage tips

- ViT と比較して、DeiT モデルはいわゆる蒸留トークンを使用して教師から効果的に学習します (これは、
  DeiT 論文は、ResNet のようなモデルです)。蒸留トークンは、バックプロパゲーションを通じて、と対話することによって学習されます。
  セルフアテンション層を介したクラス ([CLS]) とパッチ トークン。
- 抽出されたモデルを微調整するには 2 つの方法があります。(1) 上部に予測ヘッドを配置するだけの古典的な方法。
  クラス トークンの最終的な非表示状態を抽出し、蒸留シグナルを使用しない、または (2) 両方の
  予測ヘッドはクラス トークンの上と蒸留トークンの上にあります。その場合、[CLS] 予測は
  head は、head の予測とグラウンド トゥルース ラベル間の通常のクロスエントロピーを使用してトレーニングされます。
  蒸留予測ヘッドは、硬蒸留 (予測と予測の間のクロスエントロピー) を使用してトレーニングされます。
  蒸留ヘッドと教師が予測したラベル）。推論時に、平均予測を取得します。
  最終的な予測として両頭の間で。 (2) は「蒸留による微調整」とも呼ばれます。
  下流のデータセットですでに微調整されている教師。モデル的には (1) に相当します。
  [DeiTForImageClassification](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTForImageClassification) と (2) に対応します。
  [DeiTForImageClassificationWithTeacher](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTForImageClassificationWithTeacher)。
- 著者らは (2) についてもソフト蒸留を試みたことに注意してください (この場合、蒸留予測ヘッドは
  教師のソフトマックス出力に一致するように KL ダイバージェンスを使用してトレーニングしました）が、ハード蒸留が最良の結果をもたらしました。
- リリースされたすべてのチェックポイントは、ImageNet-1k のみで事前トレーニングおよび微調整されました。外部データは使用されませんでした。これは
  JFT-300M データセット/Imagenet-21k などの外部データを使用した元の ViT モデルとは対照的です。
  事前トレーニング。
- DeiT の作者は、より効率的にトレーニングされた ViT モデルもリリースしました。これは、直接プラグインできます。
  `ViTModel` または `ViTForImageClassification`。データなどのテクニック
  はるかに大規模なデータセットでのトレーニングをシミュレートするために、拡張、最適化、正則化が使用されました。
  (ただし、事前トレーニングには ImageNet-1k のみを使用します)。 4 つのバリエーション (3 つの異なるサイズ) が利用可能です。
  *facebook/deit-tiny-patch16-224*、*facebook/deit-small-patch16-224*、*facebook/deit-base-patch16-224* および
  *facebook/deit-base-patch16-384*。以下を行うには [DeiTImageProcessor](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTImageProcessor) を使用する必要があることに注意してください。
  モデル用の画像を準備します。

## Resources

DeiT を始めるのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。

- [DeiTForImageClassification](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTForImageClassification) は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)。
- 参照: [画像分類タスク ガイド](../tasks/image_classification)

それに加えて:

- [DeiTForMaskedImageModeling](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTForMaskedImageModeling) は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining) でサポートされています。

ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。

## DeiTConfig[[transformers.DeiTConfig]]

#### transformers.DeiTConfig[[transformers.DeiTConfig]]

[Source](https://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/configuration_deit.py#L24)

This is the configuration class to store the configuration of a DeiTModel. It is used to instantiate a Deit
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the [facebook/deit-base-distilled-patch16-224](https://huggingface.co/facebook/deit-base-distilled-patch16-224)

Configuration objects inherit from [PreTrainedConfig](/docs/transformers/v5.5.1/ja/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the
documentation from [PreTrainedConfig](/docs/transformers/v5.5.1/ja/main_classes/configuration#transformers.PreTrainedConfig) for more information.

Example:

```python
>>> from transformers import DeiTConfig, DeiTModel

>>> # Initializing a DeiT deit-base-distilled-patch16-224 style configuration
>>> configuration = DeiTConfig()

>>> # Initializing a model (with random weights) from the deit-base-distilled-patch16-224 style configuration
>>> model = DeiTModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

**Parameters:**

hidden_size (`int`, *optional*, defaults to `768`) : Dimension of the hidden representations.

num_hidden_layers (`int`, *optional*, defaults to `12`) : Number of hidden layers in the Transformer decoder.

num_attention_heads (`int`, *optional*, defaults to `12`) : Number of attention heads for each attention layer in the Transformer decoder.

intermediate_size (`int`, *optional*, defaults to `3072`) : Dimension of the MLP representations.

hidden_act (`str`, *optional*, defaults to `gelu`) : The non-linear activation function (function or string) in the decoder. For example, `"gelu"`, `"relu"`, `"silu"`, etc.

hidden_dropout_prob (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

attention_probs_dropout_prob (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout ratio for the attention probabilities.

initializer_range (`float`, *optional*, defaults to `0.02`) : The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

layer_norm_eps (`float`, *optional*, defaults to `1e-12`) : The epsilon used by the layer normalization layers.

image_size (`Union[int, list[int], tuple[int, int]]`, *optional*, defaults to `224`) : The size (resolution) of each image.

patch_size (`Union[int, list[int], tuple[int, int]]`, *optional*, defaults to `16`) : The size (resolution) of each patch.

num_channels (`int`, *optional*, defaults to `3`) : The number of input channels.

qkv_bias (`bool`, *optional*, defaults to `True`) : Whether to add a bias to the queries, keys and values.

encoder_stride (`int`, *optional*, defaults to 16) : Factor to increase the spatial resolution by in the decoder head for masked image modeling.

pooler_output_size (`int`, *optional*) : Dimensionality of the pooler layer. If None, defaults to `hidden_size`.

pooler_act (`str`, *optional*, defaults to `"tanh"`) : The activation function to be used by the pooler.

## DeiTImageProcessor[[transformers.DeiTImageProcessor]]

#### transformers.DeiTImageProcessor[[transformers.DeiTImageProcessor]]

[Source](https://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/image_processing_deit.py#L22)

Constructs a DeiTImageProcessor image processor.

preprocesstransformers.DeiTImageProcessor.preprocesshttps://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/image_processing_utils.py#L382[{"name": "images", "val": ": typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]"}, {"name": "*args", "val": ""}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs]"}]- **images** (`Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]`) --
  Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
  passing in images with pixel values between 0 and 1, set `do_rescale=False`.
- **return_tensors** (`str` or [TensorType](/docs/transformers/v5.5.1/ja/internal/file_utils#transformers.TensorType), *optional*) --
  Returns stacked tensors if set to `'pt'`, otherwise returns a list of tensors.
- ****kwargs** (`ImagesKwargs`, *optional*) --
  Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class
  for the complete list of supported arguments.0`~image_processing_base.BatchFeature`- **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.).
- **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at
  initialization.

**Parameters:**

- ****kwargs** (`ImagesKwargs`, *optional*) : Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

**Returns:**

``~image_processing_base.BatchFeature``

- **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.).
- **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at
  initialization.

## DeiTImageProcessorFast[[transformers.DeiTImageProcessor]]

#### transformers.DeiTImageProcessor[[transformers.DeiTImageProcessor]]

[Source](https://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/image_processing_deit.py#L22)

Constructs a DeiTImageProcessor image processor.

preprocesstransformers.DeiTImageProcessor.preprocesshttps://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/image_processing_utils.py#L382[{"name": "images", "val": ": typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]"}, {"name": "*args", "val": ""}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs]"}]- **images** (`Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]`) --
  Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
  passing in images with pixel values between 0 and 1, set `do_rescale=False`.
- **return_tensors** (`str` or [TensorType](/docs/transformers/v5.5.1/ja/internal/file_utils#transformers.TensorType), *optional*) --
  Returns stacked tensors if set to `'pt'`, otherwise returns a list of tensors.
- ****kwargs** (`ImagesKwargs`, *optional*) --
  Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class
  for the complete list of supported arguments.0`~image_processing_base.BatchFeature`- **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.).
- **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at
  initialization.

**Parameters:**

- ****kwargs** (`ImagesKwargs`, *optional*) : Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

**Returns:**

``~image_processing_base.BatchFeature``

- **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.).
- **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at
  initialization.

## DeiTModel[[transformers.DeiTModel]]

#### transformers.DeiTModel[[transformers.DeiTModel]]

[Source](https://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/modeling_deit.py#L404)

The bare Deit Model outputting raw hidden-states without any specific head on top.

This model inherits from [PreTrainedModel](/docs/transformers/v5.5.1/ja/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.DeiTModel.forwardhttps://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/modeling_deit.py#L427[{"name": "pixel_values", "val": ": torch.Tensor | None = None"}, {"name": "bool_masked_pos", "val": ": torch.BoolTensor | None = None"}, {"name": "interpolate_pos_encoding", "val": ": bool = False"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **pixel_values** (`torch.Tensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) --
  The tensors corresponding to the input images. Pixel values can be obtained using
  [DeiTImageProcessor](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTImageProcessor). See `DeiTImageProcessor.__call__()` for details (`processor_class` uses
  [DeiTImageProcessor](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTImageProcessor) for processing images).
- **bool_masked_pos** (`torch.BoolTensor` of shape `(batch_size, num_patches)`, *optional*) --
  Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
- **interpolate_pos_encoding** (`bool`, *optional*, defaults to `False`) --
  Whether to interpolate the pre-trained position encodings.0[BaseModelOutputWithPooling](/docs/transformers/v5.5.1/ja/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)`A [BaseModelOutputWithPooling](/docs/transformers/v5.5.1/ja/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) and inputs.
The [DeiTModel](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTModel) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

- **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model.
- **pooler_output** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing
  through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
  the classification token after processing through a linear layer and a tanh activation function. The linear
  layer weights are trained from the next sentence prediction (classification) objective during pretraining.
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
  sequence_length)`.

  Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
  heads.

Example:

```python
```

**Parameters:**

config ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.5.1/ja/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

add_pooling_layer (`bool`, *optional*, defaults to `True`) : Whether to add a pooling layer

use_mask_token (`bool`, *optional*, defaults to `False`) : Whether to use a mask token for masked image modeling.

**Returns:**

`[BaseModelOutputWithPooling](/docs/transformers/v5.5.1/ja/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)``

A [BaseModelOutputWithPooling](/docs/transformers/v5.5.1/ja/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) and inputs.

## DeiTForMaskedImageModeling[[transformers.DeiTForMaskedImageModeling]]

#### transformers.DeiTForMaskedImageModeling[[transformers.DeiTForMaskedImageModeling]]

[Source](https://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/modeling_deit.py#L493)

DeiT Model with a decoder on top for masked image modeling, as proposed in [SimMIM](https://huggingface.co/papers/2111.09886).

Note that we provide a script to pre-train this model on custom data in our [examples
directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).

This model inherits from [PreTrainedModel](/docs/transformers/v5.5.1/ja/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.DeiTForMaskedImageModeling.forwardhttps://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/modeling_deit.py#L511[{"name": "pixel_values", "val": ": torch.Tensor | None = None"}, {"name": "bool_masked_pos", "val": ": torch.BoolTensor | None = None"}, {"name": "interpolate_pos_encoding", "val": ": bool = False"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **pixel_values** (`torch.Tensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) --
  The tensors corresponding to the input images. Pixel values can be obtained using
  [DeiTImageProcessor](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTImageProcessor). See `DeiTImageProcessor.__call__()` for details (`processor_class` uses
  [DeiTImageProcessor](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTImageProcessor) for processing images).
- **bool_masked_pos** (`torch.BoolTensor` of shape `(batch_size, num_patches)`) --
  Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
- **interpolate_pos_encoding** (`bool`, *optional*, defaults to `False`) --
  Whether to interpolate the pre-trained position encodings.0`MaskedImageModelingOutput` or `tuple(torch.FloatTensor)`A `MaskedImageModelingOutput` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) and inputs.
The [DeiTForMaskedImageModeling](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTForMaskedImageModeling) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

- **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `bool_masked_pos` is provided) -- Reconstruction loss.
- **reconstruction** (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`) -- Reconstructed / completed images.
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or
  when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each stage) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states
  (also called feature maps) of the model at the output of each stage.
- **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when
  `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, patch_size,
  sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
  the self-attention heads.

Examples:
```python
>>> from transformers import AutoImageProcessor, DeiTForMaskedImageModeling
>>> import torch
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> with httpx.stream("GET", url) as response:
...     image = Image.open(BytesIO(response.read()))

>>> image_processor = AutoImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224")
>>> model = DeiTForMaskedImageModeling.from_pretrained("facebook/deit-base-distilled-patch16-224")

>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
>>> # create random boolean mask of shape (batch_size, num_patches)
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()

>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
>>> loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
>>> list(reconstructed_pixel_values.shape)
[1, 3, 224, 224]
```

**Parameters:**

config ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.5.1/ja/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

**Returns:**

``MaskedImageModelingOutput` or `tuple(torch.FloatTensor)``

A `MaskedImageModelingOutput` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) and inputs.

## DeiTForImageClassification[[transformers.DeiTForImageClassification]]

#### transformers.DeiTForImageClassification[[transformers.DeiTForImageClassification]]

[Source](https://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/modeling_deit.py#L595)

DeiT Model transformer with an image classification head on top (a linear layer on top of the final hidden state of
the [CLS] token) e.g. for ImageNet.

This model inherits from [PreTrainedModel](/docs/transformers/v5.5.1/ja/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.DeiTForImageClassification.forwardhttps://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/modeling_deit.py#L608[{"name": "pixel_values", "val": ": torch.Tensor | None = None"}, {"name": "labels", "val": ": torch.Tensor | None = None"}, {"name": "interpolate_pos_encoding", "val": ": bool = False"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **pixel_values** (`torch.Tensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) --
  The tensors corresponding to the input images. Pixel values can be obtained using
  [DeiTImageProcessor](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTImageProcessor). See `DeiTImageProcessor.__call__()` for details (`processor_class` uses
  [DeiTImageProcessor](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTImageProcessor) for processing images).
- **labels** (`torch.LongTensor` of shape `(batch_size,)`, *optional*) --
  Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
  config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
  `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
- **interpolate_pos_encoding** (`bool`, *optional*, defaults to `False`) --
  Whether to interpolate the pre-trained position encodings.0[ImageClassifierOutput](/docs/transformers/v5.5.1/ja/main_classes/output#transformers.modeling_outputs.ImageClassifierOutput) or `tuple(torch.FloatTensor)`A [ImageClassifierOutput](/docs/transformers/v5.5.1/ja/main_classes/output#transformers.modeling_outputs.ImageClassifierOutput) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) and inputs.
The [DeiTForImageClassification](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTForImageClassification) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

- **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Classification (or regression if config.num_labels==1) loss.
- **logits** (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`) -- Classification (or regression if config.num_labels==1) scores (before SoftMax).
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each stage) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states
  (also called feature maps) of the model at the output of each stage.
- **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, patch_size,
  sequence_length)`.

  Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
  heads.

Examples:

```python
>>> from transformers import AutoImageProcessor, DeiTForImageClassification
>>> import torch
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO

>>> torch.manual_seed(3)
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> with httpx.stream("GET", url) as response:
...     image = Image.open(BytesIO(response.read()))

>>> # note: we are loading a DeiTForImageClassificationWithTeacher from the hub here,
>>> # so the head will be randomly initialized, hence the predictions will be random
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224")
>>> model = DeiTForImageClassification.from_pretrained("facebook/deit-base-distilled-patch16-224")

>>> inputs = image_processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
Predicted class: Polaroid camera, Polaroid Land camera
```

**Parameters:**

config ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.5.1/ja/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

**Returns:**

`[ImageClassifierOutput](/docs/transformers/v5.5.1/ja/main_classes/output#transformers.modeling_outputs.ImageClassifierOutput) or `tuple(torch.FloatTensor)``

A [ImageClassifierOutput](/docs/transformers/v5.5.1/ja/main_classes/output#transformers.modeling_outputs.ImageClassifierOutput) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) and inputs.

## DeiTForImageClassificationWithTeacher[[transformers.DeiTForImageClassificationWithTeacher]]

#### transformers.DeiTForImageClassificationWithTeacher[[transformers.DeiTForImageClassificationWithTeacher]]

[Source](https://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/modeling_deit.py#L710)

DeiT Model transformer with image classification heads on top (a linear layer on top of the final hidden state of
the [CLS] token and a linear layer on top of the final hidden state of the distillation token) e.g. for ImageNet.

.. warning::

This model supports inference-only. Fine-tuning with distillation (i.e. with a teacher) is not yet
supported.

This model inherits from [PreTrainedModel](/docs/transformers/v5.5.1/ja/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.DeiTForImageClassificationWithTeacher.forwardhttps://github.com/huggingface/transformers/blob/v5.5.1/src/transformers/models/deit/modeling_deit.py#L728[{"name": "pixel_values", "val": ": torch.Tensor | None = None"}, {"name": "interpolate_pos_encoding", "val": ": bool = False"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **pixel_values** (`torch.Tensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) --
  The tensors corresponding to the input images. Pixel values can be obtained using
  [DeiTImageProcessor](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTImageProcessor). See `DeiTImageProcessor.__call__()` for details (`processor_class` uses
  [DeiTImageProcessor](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTImageProcessor) for processing images).
- **interpolate_pos_encoding** (`bool`, *optional*, defaults to `False`) --
  Whether to interpolate the pre-trained position encodings.0`DeiTForImageClassificationWithTeacherOutput` or `tuple(torch.FloatTensor)`A `DeiTForImageClassificationWithTeacherOutput` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) and inputs.
The [DeiTForImageClassificationWithTeacher](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTForImageClassificationWithTeacher) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

- **logits** (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`) -- Prediction scores as the average of the cls_logits and distillation logits.
- **cls_logits** (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`) -- Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the
  class token).
- **distillation_logits** (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`) -- Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the
  distillation token).
- **hidden_states** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- **attentions** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
  sequence_length)`.

  Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
  heads.

Example:

```python
>>> from transformers import AutoImageProcessor, DeiTForImageClassificationWithTeacher
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224")
>>> model = DeiTForImageClassificationWithTeacher.from_pretrained("facebook/deit-base-distilled-patch16-224")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
...
```

**Parameters:**

config ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.5.1/ja/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

**Returns:**

``DeiTForImageClassificationWithTeacherOutput` or `tuple(torch.FloatTensor)``

A `DeiTForImageClassificationWithTeacherOutput` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([DeiTConfig](/docs/transformers/v5.5.1/ja/model_doc/deit#transformers.DeiTConfig)) and inputs.

