File size: 4,758 Bytes

---
license: mit
datasets:
- Darsala/english_georgian_corpora
language:
- ka
- en
metrics:
- comet
- bleu
- chrf
pipeline_tag: translation
tags:
- translation
- Georgian
- NMT
- MT
- encoder-decoder
model-index:
- name: Georgian-Translation
  results:
  - task:
      type: translation
      name: Machine Translation
    dataset:
      name: FLORES Test Set
      type: flores
    metrics:
    - type: comet
      value: 0.79
      name: COMET Score
base_model: bert-base-uncased
---

# Georgian Translation Model

## Model Description

This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder.

## Architecture

- **Model Type**: Encoder-Decoder 
- **Encoder**: Pretrained BERT model
- **Decoder**: Randomly initialized with custom configuration
- **Decoder Tokenizer**: `RichNachos/georgian-corpus-tokenizer-test`
- **Parameters**: 266M total parameters

## Training Details

- **Training Data**: English-Georgian parallel corpus (see [Darsala/english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora))
- **Training Duration**: 16 epochs
- **Hardware**: Nvidia A100 80GB
- **Batch Size**: 128 with 2 gradient accumulation steps
- **Scheduler**: Cosine learning rate scheduler
- **Training Pipeline**: Complete data cleaning, preprocessing, and augmentation pipeline

## Performance

- **COMET Score**: 0.79 (on FLORES test set)
- **Comparison**: Google Translate (0.83), Kona (0.84) on same dataset
- **Translation Style**: More literary and natural Georgian compared to Google Translate

## Usage

**Important**: This model uses a custom `EncoderDecoderTokenizer` that is included in the repository. You need to download the repo locally to access it.

```python
import sys
from transformers import EncoderDecoderModel
import torch
import re
from huggingface_hub import snapshot_download

# Download the repo to a local folder
path_to_downloaded = snapshot_download(
    repo_id="Darsala/Georgian-Translation",
    local_dir="./Georgian-Translation",
    local_dir_use_symlinks=False
)

# Add the downloaded folder to Python path so we can import the custom tokenizer
sys.path.append(path_to_downloaded)
from encoder_decoder_tokenizer import EncoderDecoderTokenizer

# Load the model and tokenizer from the downloaded folder
model = EncoderDecoderModel.from_pretrained(path_to_downloaded)
tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def translate(
    text: str,
    num_beams: int = 5,
    max_length: int = 256,
) -> str:
    """
    Translate a single string with the given EncoderDecoderModel.
    """
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    
    # tokenize & move to device
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="longest"
    ).to(device)
    
    # generation
    generated_ids = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        num_beams=num_beams,
        max_length=max_length,
        early_stopping=True,
    )
    
    output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"English: {text}")
    print(f"Translated: {output}")
    
    return output

# Example usage
translation = translate("Hello, how are you?")
```

**Note**: The model uses a custom `EncoderDecoderTokenizer` that is included in the repository.

## Strengths and Limitations

### Strengths
- Produces more literary and natural Georgian translations
- Good performance on general text translation
- Specialized for Georgian language characteristics

### Limitations
- Struggles with proper names and company names
- Issues with terms requiring direct English text copying
- Limited by tokenizer coverage for certain English terms

## Demo

Try the model in the interactive demo: [Georgian Translation Space](https://huggingface.co/spaces/Darsala/Georgian-Translation)

## Citation

```bibtex
@mastersthesis{darsalia2025georgian,
  title={English Translation Quality Assessment and Computer Translation},
  author={Luka Darsalia},
  year={2025},
  school={Tbilisi University},
  note={Bachelor's Thesis - Computer Science}
}
```

## Related Resources

- **Training Data**: [english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora)
- **Georgian COMET Model**: [georgian_comet](https://huggingface.co/Darsala/georgian_comet)
- **Evaluation Data**: [georgian_metric_evaluation](https://huggingface.co/datasets/Darsala/georgian_metric_evaluation)