CPU-1 Ablation Study β€” Complete Checkpoint Archive

Repo: Cukinator/cpu1-ablation-checkpoints Unpacked weights: Cukinator/cpu1-ablations-final Source code: github.com/Cukinator/1.58bits


This repository is the complete checkpoint archive for the CPU-1 ablation study: a systematic, one-component-at-a-time dissection of the design choices behind a 1.58-bit ternary language model optimised for CPU inference.

1346 checkpoint files Β· 200.4 GB total Β· 34 run folders

Two checkpoint flavours are stored per run:

File pattern Format Size Purpose
checkpoint_<run>_final.pt compact_2bit (2-bit packed ternary + bf16 scales) ~3–104 MB Final inference weights β€” load directly with load_ablation_checkpoint()
checkpoint_<run>_step<N>.pt bf16 model + bf16 optimizer state ~20–313 MB Mid-training resume point
checkpoint_<run>_phase2_step<N>.pt bf16 model + bf16 optimizer state same Phase-2 (DeleteGate fine-tune) resume point

Just want to run inference? Use Cukinator/cpu1-ablations-final β€” plain float32 .pt files, no unpacking needed.


Architecture overview

Scale d_model n_layers d_ff n_heads Params
50M (runs 01–10) 512 12 1376 8 ~50M
10M (runs 13–16) 320 8 853 8 ~10M

All runs use:

  • 1.58-bit ternary weights (BitLinear / BitEmbedding) by default β€” FP16 runs are explicit baselines
  • Byte-level patch tokenisation (patch_size=4) except run_01 / run_13 which use BPE
  • Chinchilla-scaled training budget: 2 tok/param (r1), 15 tok/param (r2), dense step checkpoints (r3)

50M-parameter ablation chain (d_model=512, n_layers=12)

Each run adds exactly one component to the previous:

run_01  Transformer + BPE 16K + FP16          ← absolute baseline
  β”‚
  β”œβ”€ run_02a  + byte patches (no LocalByteDecoder)
  β”‚     └─ run_02  + LocalByteDecoder (MegaByte intra-patch)
  β”‚           └─ run_03  swap Transformer β†’ MLGRU  (FP16)
  β”‚                 └─ run_04  + ternary quantisation (1.58-bit)
  β”‚                       └─ run_05  + FPResidual (CPU-1 core)
  β”‚                             β”œβ”€ run_05b  βˆ’ W_o  (production kernel layout)
  β”‚                             β”œβ”€ run_08   swap MLGRU β†’ folded Transformer
  β”‚                             └─ run_06  + BolmoPatchEmbedding
  β”‚                                   └─ run_07  + DeleteGate  ⭐ CPU-1 COMPLETE
  β”‚                                         └─ run_09  + PFNet (pfnet_hidden=32)
  β”‚                                               └─ run_10  + per-channel decay
  β”‚
  β”œβ”€ Round 2: run_04_r2, run_07_r2  (same architecture, 15 tok/param)
  └─ Round 3: run_XX_v3/            (dense step-by-step checkpoints)

Round 1 β€” 2 tok/param

Run Internal name Steps saved Max step _final size Step ckpt size Total Description
run_01 transformer_bpe_fp16 21 900 104.4 MB 313.3 MB 6684 MB Transformer + BPE 16K vocab + FP16. Absolute baseline.
run_02a_byte_only_heads transformer_byte_fp16_no_lbd 21 640 73.5 MB 220.5 MB 4703 MB +Byte patches, 4 independent byte heads (no LocalByteDecoder). Tokenisation isolation.
run_02 transformer_byte_fp16 21 640 74.1 MB 222.3 MB 4743 MB +LocalByteDecoder β€” MegaByte autoregressive intra-patch chain over run_02a.
run_03 mlgru_byte_fp16 21 640 74.1 MB 222.4 MB 4744 MB Swap Transformer β†’ MLGRU. FP16, byte patches + LocalByteDecoder.
run_04 mlgru_byte_ternary 21 640 9.4 MB 74.2 MB 1567 MB +Ternary quantisation (1.58-bit). Isolates BitNet cost on MLGRU+byte.
run_05 mlgru_byte_ternary_fpres 21 640 9.7 MB 74.5 MB 1574 MB +FPResidual (low-rank FP16 correction). CPU-1 core architecture.
run_05b_kernel_strict mlgru_kernel_strict 21 600 8.9 MB 68.4 MB 1446 MB Branch from run_05: remove W_o to match production C++ OMP kernel layout.
run_06 mlgru_byte_ternary_fpres_bolmo 21 640 9.7 MB 74.5 MB 1574 MB +BolmoPatchEmbedding β€” boundary-aware patch encoding via cross-attention.
run_07 cpu1_complete 21 640 9.8 MB 74.5 MB 1575 MB +DeleteGate (real MrT5 gather, ~40% elimination at layer N//2). CPU-1 COMPLETE. ⭐
run_08 folded_transformer_byte_ternary 21 640 9.4 MB 74.2 MB 1567 MB Branch from run_05: swap MLGRU β†’ ternary sliding-window Transformer (window=128).
run_09 cpu1_pfnet 21 640 9.9 MB 75.3 MB 1591 MB +PFNet (pfnet_hidden=32): cache-resident nonlinear residual per block.
run_10 cpu1_decay_learned 21 640 9.9 MB 75.3 MB 1592 MB +Per-channel learnable decay in MLGRU (RWKV/HGRN2-style). Chain terminus.

Round 2 β€” 15 tok/param

Run Steps saved Max step _final size Step ckpt size Total Description
run_04_r2 21 4,840 9.4 MB 74.2 MB 1567 MB run_04 architecture at 15 tok/param β€” ternary MLGRU baseline with real budget.
run_07_r2 21 4,860 β€” (no final) 74.5 MB 1565 MB run_07 (CPU-1 COMPLETE) at 15 tok/param. No _final β€” training stopped early. ⭐

Round 3 β€” step-by-step checkpoints (50M)

Dense step-by-step checkpoints from the v3 training run. Full bf16 model + bf16 optimizer state per step β€” no _final.pt.

Folder Step checkpoints Max step Per-file size Total size
run_01_v3/ 72 12,173 313.3 MB 22560 MB
run_02_v3/ 73 10,963 222.3 MB 16231 MB
run_02a_byte_only_heads_v3/ 75 16,327 220.5 MB 16534 MB
run_03_v3/ 71 6,187 222.4 MB 15788 MB
run_04_v3/ 71 6,074 222.5 MB 15800 MB
run_05_v3/ 71 4,390 223.4 MB 15861 MB
run_05b_kernel_strict_v3/ 71 2,573 205.3 MB 14579 MB
run_06_v3/ 71 4,237 223.4 MB 15862 MB
run_08_v3/ 71 3,858 222.5 MB 15799 MB

Round 3 total (9 folders): 145.5 GB


10M-parameter runs (d_model=320, n_layers=8)

All use CPU-1 COMPLETE architecture. Training strategy is the only variable.

Datasets:

  • run_13, run_14, run_15: Cukinator/cpu1-ablation-dataset (~37.5M tokens, Qwen2.5-3B teacher logprobs + hidden states)
  • run_16: HuggingFaceFW/fineweb directly (CE-only, no teacher β€” the control)

Round 1 β€” 2 tok/param

Run Internal name Steps saved Max step _final size Step ckpt size Total Description
run_13 small_cpu1_bpe 21 600 3.2 MB 71.9 MB 1513 MB CPU-1 @ 10M, BPE 4K vocab, distillation from Qwen2.5-3B logprobs. Does BPE beat bytes at 10M?
run_14 small_cpu1_byte 21 600 2.8 MB 20.5 MB 432 MB CPU-1 @ 10M, byte-level, distillation (logprobs only). Pure byte baseline. ⭐
run_15 small_cpu1_byte_hidden 21 600 2.9 MB 20.5 MB 434 MB run_14 + EmbeddingAligner (hidden-state distillation). Does aligning hidden reps help?
run_16 small_cpu1_raw_bytes 21 600 2.8 MB 20.5 MB 432 MB CPU-1 @ 10M, byte-level, zero teacher (CE-only on FineWeb). Do Qwen logprobs in run_14 actually help?

Round 2 β€” 15 tok/param

Run Steps saved Max step _final size Step ckpt size Total Description
run_13_r2 21 1,560 3.2 MB 71.9 MB 1513 MB run_13 at 15 tok/param β€” BPE + ternary with sufficient budget.
run_14_r2 21 1,320 2.8 MB 20.5 MB 432 MB run_14 at 15 tok/param β€” byte baseline with sufficient budget. ⭐
run_15_r2 21 1,340 2.9 MB 20.5 MB 434 MB run_15 at 15 tok/param β€” byte + hidden distillation with sufficient budget.
run_16_r2 21 1,320 2.8 MB 20.5 MB 432 MB run_16 at 15 tok/param β€” zero-teacher bytes with sufficient budget.

Round 3 β€” step-by-step checkpoints (10M)

Dense step-by-step checkpoints from the v3 training run. Full bf16 model + bf16 optimizer state per step β€” no _final.pt.

Folder Step checkpoints Max step Per-file size Total size
run_13_v3/ 72 9,290 71.9 MB 5175 MB
run_14_v3/ 75 9,775 61.3 MB 4597 MB
run_15_v3/ 70 2,350 61.5 MB 4307 MB

Round 3 total (3 folders): 13.7 GB


Quick start

Load a compact_2bit final checkpoint

import sys
sys.path.insert(0, "/path/to/1.58bits")
from train_ablation_amd import load_ablation_checkpoint, build_ablation_model, generate
import torch

state, config = load_ablation_checkpoint("run_07/checkpoint_run_07_final.pt")
model = build_ablation_model(config)
model.load_state_dict(state, strict=False)
model.eval()

output = generate(model, "Once upon a time", max_new_bytes=128, config=config, device=torch.device("cpu"))
print(output)

Resume training from a step checkpoint

python train_ablation_amd.py --run run_07 --resume_from run_07/checkpoint_run_07_step640.pt

Download with huggingface_hub

from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="Cukinator/cpu1-ablation-checkpoints",
    filename="run_07/checkpoint_run_07_final.pt",
    repo_type="model",
)

Related repositories

Repo Contents
Cukinator/cpu1-ablation-checkpoints This repo β€” raw training checkpoints (compact_2bit finals + bf16 step files)
Cukinator/cpu1-ablations-final Unpacked float32 weights β€” ready for model.load_state_dict() without any helper
Cukinator/cpu1-ablation-dataset Pre-processed training dataset with Qwen2.5-3B teacher logprobs + hidden states

License

Apache-2.0. See github.com/Cukinator/1.58bits.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support