Attention Is Not All You Need - Model Checkpoints

This repository contains the best model checkpoints from the reproduction of "Attention Is Not What You Need" paper.

Models Included

Wikitext-2 Language Modeling

  • Transformer Models: L128/L256 with N6/N12 layers
  • Grassmann Models: L128/L256 with N6/N12 layers

SNLI Natural Language Inference

  • Transformer Model: Classification head trained from scratch
  • Grassmann Model: Classification head trained from scratch

Results Summary

Wikitext-2 (Best Validation PPL)

  • Best Transformer: L=256, N=12 β†’ 168.68 PPL
  • Best Grassmann: L=128, N=12 β†’ 244.61 PPL
  • Gap: 45.0% (Grassmann underperforms)

SNLI (Test Accuracy)

  • Grassmann: 71.25% accuracy
  • Transformer: 66.71% accuracy
  • Gap: +4.54% (Grassmann outperforms!)

Repository Structure

β”œβ”€β”€ grassmann_snli/
β”‚   β”œβ”€β”€ checkpoints/best.pt
β”‚   β”œβ”€β”€ snli_test_results.json
β”‚   └── snli_validation_results.json
β”œβ”€β”€ grassmann_wikitext_L128_N6/
β”‚   β”œβ”€β”€ checkpoints/best.pt
β”‚   β”œβ”€β”€ results.json
β”‚   └── wikitext_validation_results.json
β”œβ”€β”€ grassmann_wikitext_L128_N12/
β”‚   β”œβ”€β”€ checkpoints/best.pt
β”‚   β”œβ”€β”€ results.json
β”‚   └── wikitext_validation_results.json
β”œβ”€β”€ grassmann_wikitext_L256_N6/
β”‚   β”œβ”€β”€ checkpoints/best.pt
β”‚   β”œβ”€β”€ results.json
β”‚   └── wikitext_validation_results.json
β”œβ”€β”€ grassmann_wikitext_L256_N12/
β”‚   β”œβ”€β”€ checkpoints/best.pt
β”‚   β”œβ”€β”€ results.json
β”‚   └── wikitext_validation_results.json
β”œβ”€β”€ transformer_snli/
β”‚   β”œβ”€β”€ checkpoints/best.pt
β”‚   β”œβ”€β”€ snli_test_results.json
β”‚   └── snli_validation_results.json
β”œβ”€β”€ transformer_wikitext_L128_N6/
β”‚   β”œβ”€β”€ checkpoints/best.pt
β”‚   β”œβ”€β”€ results.json
β”‚   └── wikitext_validation_results.json
β”œβ”€β”€ transformer_wikitext_L128_N12/
β”‚   β”œβ”€β”€ checkpoints/best.pt
β”‚   β”œβ”€β”€ results.json
β”‚   └── wikitext_validation_results.json
β”œβ”€β”€ transformer_wikitext_L256_N6/
β”‚   β”œβ”€β”€ checkpoints/best.pt
β”‚   β”œβ”€β”€ results.json
β”‚   └── wikitext_validation_results.json
└── transformer_wikitext_L256_N12/
    β”œβ”€β”€ checkpoints/best.pt
    β”œβ”€β”€ results.json
    └── wikitext_validation_results.json

Loading Models

import torch

# Load a checkpoint
checkpoint = torch.load("grassmann_wikitext_L256_N12/checkpoints/best.pt")

# Access model state
model_state = checkpoint['model_state_dict']
epoch = checkpoint['epoch']
val_loss = checkpoint['val_loss']

print(f"Epoch: {epoch}, Val Loss: {val_loss}")

Citation

If you use these models, please cite the original paper reproduction:

@misc{attn-is-not-all-you-need-reproduction,
  title={Reproduction of "Attention Is Not What You Need"},
  author={alphaXiv},
  year={2026},
  url={https://github.com/alphaXiv/paper-implementations}
}

Hardware

All models trained on:

  • GPU: NVIDIA H100 SXM5 80GB
  • Platform: Lambda Labs, Lambda Stack 22.04

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including alphaXiv/attention-is-not-all-you-need-models