IndicBERT_WOR

Model Description

IndicBERT_WOR is a Telugu sentiment classification model built on IndicBERT (ai4bharat/indicBERTv2-MLM-only), a multilingual BERT-style Transformer model developed by AI4Bharat for Indian languages.

IndicBERT is pretrained on OSCAR and AI4Bharat-curated corpora covering 12 Indian languages, including Telugu and English. The model is trained exclusively using the Masked Language Modeling (MLM) objective, focusing on learning high-quality language-specific representations rather than cross-lingual alignment.

The suffix WOR denotes Without Rationale supervision. This model is fine-tuned using only sentiment labels and serves as a label-only baseline for Telugu sentiment classification.


Pretraining Details

  • Pretraining corpora:
    • OSCAR
    • AI4Bharat-curated Indian language corpora
  • Training objective:
    • Masked Language Modeling (MLM)
  • Language coverage: 12 Indian languages, including Telugu and English
  • Code-mixed support: Not supported

Training Data

  • Fine-tuning dataset: Telugu-Dataset
  • Task: Sentiment classification
  • Supervision type: Label-only (no rationale supervision)

Intended Use

This model is intended for:

  • Telugu sentiment classification
  • Monolingual Telugu NLP tasks
  • Benchmarking Indian-language-focused models
  • Baseline comparisons in explainability and rationale-supervision studies

IndicBERT_WOR is better suited for monolingual Telugu tasks rather than cross-lingual or code-mixed scenarios.


Performance Characteristics

IndicBERT provides language-aware tokenization and clean embeddings, making it well-suited for Telugu sentiment analysis with efficient training.

Strengths

  • Strong Telugu-specific representations
  • Faster training compared to large multilingual models
  • Effective for monolingual Telugu sentiment classification

Limitations

  • Not designed for cross-lingual transfer learning
  • Does not support code-mixed data
  • Lacks rationale supervision

Use as a Baseline

IndicBERT_WOR serves as a strong Indian-language baseline for:

  • Comparing general multilingual vs. Indian-language-focused models
  • Evaluating the effect of rationale supervision (WOR vs. WR)
  • Telugu sentiment classification in low-resource settings

References

  • Joshi, 2022
  • Marreddy et al., 2022
  • Duggenpudi et al., 2022
  • Rajalakshmi et al., 2023
Downloads last month
6
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train DSL-13-SRMAP/IndicBERT_WOR