Turkish Subwords Research
Collection
Collection models, tokenizers and testsets for the research work "Optimal Turkish Subword Strategies at Scale". The models are experimental models. • 35 items • Updated • 2
How to use turkish-nlp-suite/bert-2K-minimal with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="turkish-nlp-suite/bert-2K-minimal") # Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("turkish-nlp-suite/bert-2K-minimal")
model = AutoModel.from_pretrained("turkish-nlp-suite/bert-2K-minimal")This is a BERT model from the Turkish transformer collection of research work Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay.
The collection Turkish Subwords Research contains BERT models and this model read as trained with wordpiece-2K-minimal tokenizer. Tokenizers comes with several vocabulary sizes and trained on 3 sizes of corpora, minimal, medium and alldata. The collection contains all the tokenizers of the name wordpiece_{voxab-size}k_{corpus size}. For more information, plrease refer to the research paper.
This is not a production model, RESEARCH PURPOSES only.