File size: 4,758 Bytes
e5b602f 041a5cd e5b602f b05e960 041a5cd adcd935 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: mit
datasets:
- Darsala/english_georgian_corpora
language:
- ka
- en
metrics:
- comet
- bleu
- chrf
pipeline_tag: translation
tags:
- translation
- Georgian
- NMT
- MT
- encoder-decoder
model-index:
- name: Georgian-Translation
results:
- task:
type: translation
name: Machine Translation
dataset:
name: FLORES Test Set
type: flores
metrics:
- type: comet
value: 0.79
name: COMET Score
base_model: bert-base-uncased
---
# Georgian Translation Model
## Model Description
This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder.
## Architecture
- **Model Type**: Encoder-Decoder
- **Encoder**: Pretrained BERT model
- **Decoder**: Randomly initialized with custom configuration
- **Decoder Tokenizer**: `RichNachos/georgian-corpus-tokenizer-test`
- **Parameters**: 266M total parameters
## Training Details
- **Training Data**: English-Georgian parallel corpus (see [Darsala/english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora))
- **Training Duration**: 16 epochs
- **Hardware**: Nvidia A100 80GB
- **Batch Size**: 128 with 2 gradient accumulation steps
- **Scheduler**: Cosine learning rate scheduler
- **Training Pipeline**: Complete data cleaning, preprocessing, and augmentation pipeline
## Performance
- **COMET Score**: 0.79 (on FLORES test set)
- **Comparison**: Google Translate (0.83), Kona (0.84) on same dataset
- **Translation Style**: More literary and natural Georgian compared to Google Translate
## Usage
**Important**: This model uses a custom `EncoderDecoderTokenizer` that is included in the repository. You need to download the repo locally to access it.
```python
import sys
from transformers import EncoderDecoderModel
import torch
import re
from huggingface_hub import snapshot_download
# Download the repo to a local folder
path_to_downloaded = snapshot_download(
repo_id="Darsala/Georgian-Translation",
local_dir="./Georgian-Translation",
local_dir_use_symlinks=False
)
# Add the downloaded folder to Python path so we can import the custom tokenizer
sys.path.append(path_to_downloaded)
from encoder_decoder_tokenizer import EncoderDecoderTokenizer
# Load the model and tokenizer from the downloaded folder
model = EncoderDecoderModel.from_pretrained(path_to_downloaded)
tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def translate(
text: str,
num_beams: int = 5,
max_length: int = 256,
) -> str:
"""
Translate a single string with the given EncoderDecoderModel.
"""
text = text.lower()
text = re.sub(r'\s+', ' ', text)
# tokenize & move to device
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding="longest"
).to(device)
# generation
generated_ids = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
num_beams=num_beams,
max_length=max_length,
early_stopping=True,
)
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"English: {text}")
print(f"Translated: {output}")
return output
# Example usage
translation = translate("Hello, how are you?")
```
**Note**: The model uses a custom `EncoderDecoderTokenizer` that is included in the repository.
## Strengths and Limitations
### Strengths
- Produces more literary and natural Georgian translations
- Good performance on general text translation
- Specialized for Georgian language characteristics
### Limitations
- Struggles with proper names and company names
- Issues with terms requiring direct English text copying
- Limited by tokenizer coverage for certain English terms
## Demo
Try the model in the interactive demo: [Georgian Translation Space](https://huggingface.co/spaces/Darsala/Georgian-Translation)
## Citation
```bibtex
@mastersthesis{darsalia2025georgian,
title={English Translation Quality Assessment and Computer Translation},
author={Luka Darsalia},
year={2025},
school={Tbilisi University},
note={Bachelor's Thesis - Computer Science}
}
```
## Related Resources
- **Training Data**: [english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora)
- **Georgian COMET Model**: [georgian_comet](https://huggingface.co/Darsala/georgian_comet)
- **Evaluation Data**: [georgian_metric_evaluation](https://huggingface.co/datasets/Darsala/georgian_metric_evaluation) |