Transformers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

BertJapanese

Overview

BERT モデルは日本語テキストでトレーニングされました。

2 つの異なるトークン化方法を備えたモデルがあります。

MeCab と WordPiece を使用してトークン化します。これには、MeCab のラッパーである fugashi という追加の依存関係が必要です。
文字にトークン化します。

MecabTokenizer を使用するには、pip installTransformers["ja"] (または、インストールする場合は pip install -e .["ja"]) する必要があります。ソースから）依存関係をインストールします。

cl-tohakuリポジトリの詳細を参照してください。

MeCab および WordPiece トークン化でモデルを使用する例:

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾輩 は 猫 で ある 。 [SEP]

>>> outputs = bertjapanese(**inputs)

文字トークン化を使用したモデルの使用例:

>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")

>>> ## Input Japanese Text
>>> line = "吾輩は猫である。"

>>> inputs = tokenizer(line, return_tensors="pt")

>>> print(tokenizer.decode(inputs["input_ids"][0]))
[CLS] 吾 輩 は 猫 で あ る 。 [SEP]

>>> outputs = bertjapanese(**inputs)

この実装はトークン化方法を除いて BERT と同じです。その他の使用例については、BERT のドキュメントを参照してください。

このモデルはcl-tohakuから提供されました。

BertJapaneseTokenizer

class transformers.BertJapaneseTokenizer

< source >

( vocab_file spm_file = None do_lower_case = False do_word_tokenize = True do_subword_tokenize = True word_tokenizer_type = 'basic' subword_tokenizer_type = 'wordpiece' never_split = None unk_token = '[UNK]' sep_token = '[SEP]' pad_token = '[PAD]' cls_token = '[CLS]' mask_token = '[MASK]' mecab_kwargs = None sudachi_kwargs = None jumanpp_kwargs = None **kwargs )

Parameters

vocab_file (str) — Path to a one-wordpiece-per-line vocabulary file.
spm_file (str, optional) — Path to SentencePiece file (generally has a .spm or .model extension) that contains the vocabulary.
do_lower_case (bool, optional, defaults to True) — Whether to lower case the input. Only has an effect when do_basic_tokenize=True.
do_word_tokenize (bool, optional, defaults to True) — Whether to do word tokenization.
do_subword_tokenize (bool, optional, defaults to True) — Whether to do subword tokenization.
word_tokenizer_type (str, optional, defaults to "basic") — Type of word tokenizer. Choose from [“basic”, “mecab”, “sudachi”, “jumanpp”].
subword_tokenizer_type (str, optional, defaults to "wordpiece") — Type of subword tokenizer. Choose from [“wordpiece”, “character”, “sentencepiece”,].
mecab_kwargs (dict, optional) — Dictionary passed to the MecabTokenizer constructor.
sudachi_kwargs (dict, optional) — Dictionary passed to the SudachiTokenizer constructor.
jumanpp_kwargs (dict, optional) — Dictionary passed to the JumanppTokenizer constructor.

Construct a BERT tokenizer for Japanese text.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to: this superclass for more information regarding those methods.

convert_tokens_to_string

< source >

( tokens )

Converts a sequence of tokens (string) in a single string.

Update on GitHub

←BertGeneration Bertweet→