Attention Is Not All You Need - Model Checkpoints

This repository contains the best model checkpoints from the reproduction of "Attention Is Not What You Need" paper.

Models Included

Wikitext-2 Language Modeling

Transformer Models: L128/L256 with N6/N12 layers
Grassmann Models: L128/L256 with N6/N12 layers

SNLI Natural Language Inference

Transformer Model: Classification head trained from scratch
Grassmann Model: Classification head trained from scratch

Results Summary

Wikitext-2 (Best Validation PPL)

Best Transformer: L=256, N=12 → 168.68 PPL
Best Grassmann: L=128, N=12 → 244.61 PPL
Gap: 45.0% (Grassmann underperforms)

SNLI (Test Accuracy)

Grassmann: 71.25% accuracy
Transformer: 66.71% accuracy
Gap: +4.54% (Grassmann outperforms!)

Repository Structure

├── grassmann_snli/
│   ├── checkpoints/best.pt
│   ├── snli_test_results.json
│   └── snli_validation_results.json
├── grassmann_wikitext_L128_N6/
│   ├── checkpoints/best.pt
│   ├── results.json
│   └── wikitext_validation_results.json
├── grassmann_wikitext_L128_N12/
│   ├── checkpoints/best.pt
│   ├── results.json
│   └── wikitext_validation_results.json
├── grassmann_wikitext_L256_N6/
│   ├── checkpoints/best.pt
│   ├── results.json
│   └── wikitext_validation_results.json
├── grassmann_wikitext_L256_N12/
│   ├── checkpoints/best.pt
│   ├── results.json
│   └── wikitext_validation_results.json
├── transformer_snli/
│   ├── checkpoints/best.pt
│   ├── snli_test_results.json
│   └── snli_validation_results.json
├── transformer_wikitext_L128_N6/
│   ├── checkpoints/best.pt
│   ├── results.json
│   └── wikitext_validation_results.json
├── transformer_wikitext_L128_N12/
│   ├── checkpoints/best.pt
│   ├── results.json
│   └── wikitext_validation_results.json
├── transformer_wikitext_L256_N6/
│   ├── checkpoints/best.pt
│   ├── results.json
│   └── wikitext_validation_results.json
└── transformer_wikitext_L256_N12/
    ├── checkpoints/best.pt
    ├── results.json
    └── wikitext_validation_results.json

Loading Models

import torch

# Load a checkpoint
checkpoint = torch.load("grassmann_wikitext_L256_N12/checkpoints/best.pt")

# Access model state
model_state = checkpoint['model_state_dict']
epoch = checkpoint['epoch']
val_loss = checkpoint['val_loss']

print(f"Epoch: {epoch}, Val Loss: {val_loss}")

Citation

If you use these models, please cite the original paper reproduction:

@misc{attn-is-not-all-you-need-reproduction,
  title={Reproduction of "Attention Is Not What You Need"},
  author={alphaXiv},
  year={2026},
  url={https://github.com/alphaXiv/paper-implementations}
}

Hardware

All models trained on:

GPU: NVIDIA H100 SXM5 80GB
Platform: Lambda Labs, Lambda Stack 22.04

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including alphaXiv/attention-is-not-all-you-need-models

attention-is-not-all-you-need

Collection

3 items • Updated 22 days ago