An experiment with attention.

Community Article Published May 23, 2026

Upvote

Enderchef (Enderchefcoder)

Enderchef

At first I asked myself:

is it possible to replace full attention with something cheaper, while still keeping enough context to generate the right next token?
can a model preserve weak, parallel instructions without explicitly classifying them?
if we compress context into a smaller state, what do we actually lose?

That question sounds simple, but it hides a subtle problem.

A context window is not just a sequence of tokens. In practice, it often contains several things at once: task instructions, style hints, formatting rules, and the actual content of the request. These are not always “sequential” in the way next-token prediction suggests.

For example:

system: use emojis
user: make me an app

A model might build the app correctly and quietly forget the emoji rule. That is interesting, because the emoji instruction is still part of the context, but it is weakly local. It matters globally, not necessarily at each immediate token.

That led me to a different framing.

Instead of asking whether a model can keep all tokens equally well, I wanted to ask this:

can a compressed context state preserve weak early rules while the sequence gets long?
and how does that compare to ordinary attention?

The setup

I built a small reproducible benchmark in Python.

The experiment compares two small language models:

attention: a standard causal Transformer-style attention baseline
compressed: a model that replaces token-to-token attention with a learned compressed memory state made of a few implicit slots

Important detail: the compressed model does not explicitly classify tokens into “rules”, “items”, or “instructions”. The structure stays implicit. It just reads tokens, updates a compact state, and uses that state for future prediction.

The dataset is synthetic and hardcoded, which keeps the test clean and reproducible. Each example contains:

two early rules
one item
a long distractor-filled prefix
a target sequence that requires the model to recover the early rules later

That means the benchmark is not just measuring general token prediction. It is specifically stressing rule retention over distance.

I ran the same setup across three context lengths:

ctx64
ctx256
ctx1028

The run used CUDA when available.

What I measured

I tracked four things:

validation loss
validation token accuracy
rule retention accuracy
wall-clock training time

Rule retention accuracy is the key metric here. It measures whether the model can still reproduce the early weak constraints later in the sequence.

Here is the combined benchmark plot:

Results

The short version:

attention won clearly on quality
attention also won clearly on speed
the compressed model did not show a rule-retention advantage in this setup
the gap became larger as context length increased

At ctx64:

attention reached val_acc=0.938 and rule_acc=0.906
compressed reached val_acc=0.699 and rule_acc=0.492

At ctx256:

attention reached val_acc=0.757 and rule_acc=0.581
compressed reached val_acc=0.633 and rule_acc=0.358

At ctx1028:

attention reached val_acc=0.701 and rule_acc=0.492
compressed reached val_acc=0.577 and rule_acc=0.263

The timing result was even more striking.

At ctx1028:

attention finished in about 9.9s
compressed took about 229.4s

So the compressed model was not only less accurate, it was dramatically slower in this implementation.

What this means

The main lesson is not “compression is bad”.

The lesson is narrower and more useful:

a naive compressed recurrent context state does not automatically beat attention
preserving weak parallel instructions is harder than just keeping a rolling summary
full attention is still very strong, even when the task is specifically designed to make early rules matter

There is also an implementation lesson.

The compressed model updates memory step by step in sequence order. That creates a lot of serial work. The attention baseline, even though it is mathematically heavier, benefits from very optimized parallel kernels. So some of the speed gap is architectural, but some of it is just implementation reality.

Still, even ignoring speed, the quality gap matters. In this benchmark, compression did not preserve the weak early rules better than attention. It preserved them worse.

Why this experiment still matters

I still think the motivating idea is interesting.

The experiment started from a real intuition:

context is not just a flat sequence
some instructions are globally relevant but locally weak
a good alternative to attention would need to preserve those signals without keeping every token exactly

That intuition remains valid.

What changed is this: my first compressed block was too simple to do that well.

It formed a bottleneck, but not a smart enough one.

So this experiment did not disprove the broader idea. It only showed that this particular compressed-memory replacement is not yet competitive.

My takeaway

If we want to replace or relax attention, the replacement probably needs all of these:

a better way to preserve weak long-range constraints
a more parallel implementation
a more selective memory update rule
a benchmark where exact rule retention is measured directly, not hidden inside average loss

In other words, the next step is not “compress harder”.

The next step is:

compress more carefully

Closing thought

This was a small experiment, not a claim that attention is solved or obsolete. But it was useful because it made the question concrete.

Instead of talking abstractly about “efficient context”, I now have a clearer picture:

attention is expensive, but extremely effective
compressed context is not enough by itself
weak parallel instructions are a real stress test
if we want cheaper context mechanisms, they need to preserve global obligations without collapsing into a vague summary

That is a much better place to continue from.