OpenRA-Bench

Rank	Agent	Type	Status	Opponent	Games	Win Rate (%)	Score	K/D Ratio	Avg Kills	Avg Deaths	Avg Economy	Avg Game Length	Date	Replay
1	qwen/qwen3-coder-next	LLM	Verified	Beginner	1	0	18.3	0.34	1000	2900	9050	27349	2026-02-25

Rank	Agent	Type	Status	Opponent	Games	Win Rate (%)	Score	K/D Ratio	Avg Kills	Avg Deaths	Avg Economy	Avg Game Length	Date	Replay
1	qwen/qwen3-coder-next	LLM	Verified	Beginner	1	0	18.3	0.34	1000	2900	9050	27349	2026-02-25

ID	Title	Capability	Map	Real-World Meaning	Robotics Analogue	Benchmark Anchor
lh-build-army-coordinate-multifront-attack	Lure the Tiger Off the Mountain — Pull the Strong Defender Off the Lane, Then Run the Main Force Past It	adversarial	mfb-base-1-defend-base-2-build-arena	A treasury-management decision under a deadline. The agent has a productive operation (two harvesters on two near patches feeding a refinery — steady income), starting cash 1500, and a hard win requirement to grow operations (build MORE buildings) WHILE also finishing the episode with a positive operating reserve in the bank (cash ≥ 300). The classic anti-patterns: 1) STALL — preserve every credit, build nothing. The reserve stays safe but operations never grow → the "build more buildings" bar is never met → LOSS. 2) OVER-SPEND — spend everything (and then some, by chaining a 1400cr refinery and any other expensive build) on construction and dip the reserve to zero. Income eventually refills but the window between buildings-met and reserve-refilled may not close before the deadline → LOSS. 3) PURE-HOLD — harvest and sit on the cash forever, never building. Same as STALL on the building-growth axis → LOSS. The intended capability is the SC2 cash-overflow / corporate treasury discipline: spend ENOUGH to keep growing (buildings) while keeping ENOUGH liquid (cash) to weather the next shock — never let the operating reserve hit zero, never sit on idle capital. With a pre-placed 2× harv income engine the model gets ~190 cr/turn of inflow, so any modest spend (1-2 powr at 300cr each) is recovered in a handful of turns and the reserve closes the episode well above 300.	Commit a forward processor to a contested deposit, with convoy security. An autonomous extraction operation is mining a small home deposit at sub-quota throughput; a richer deposit sits across a contested corridor, patrolled by hostile platforms. The forward processor commission ships with a co-deployed extractor on-site, but the extractor is unarmed and the patrol will destroy it within seconds unless the operator dispatches the available defensive platforms ahead of the commission. Stalling at the home throughput rate never clears the daily revenue target; commissioning the processor at home doubles processors but doesn't add a new deposit; commissioning the processor on the contested deposit without convoy security loses the extractor on first contact.	SC2 hard-counter / scout-and-pivot doctrine, game-theoretic best response under composition uncertainty, military RPS (tank-vs-infantry / infantry-vs-rocket / rocket-vs-tank), capability-based defense procurement under uncertainty, PlanBench replanning on new observation

ID	Title	Capability	Map	Real-World Meaning	Robotics Analogue	Benchmark Anchor
action-multiunit-coordination	Multi-Squad Coordination — Drive Several Groups to Different Objectives at Once	action	rush-hour-arena	Multi-region task allocation under a shared deadline is not a planning problem here — the split is obvious. It tests whether the controller can actually drive several effector groups in parallel instead of completing one objective then the next, which serialized control fails to do in time.	Coordinated multi-robot fleet dispatch: a logistics swarm must place distinct sub-teams at separate depots within one delivery window; one-at-a-time control blows the schedule.
action-sequenced-execution	Non-Stop Ordered Route Execution — Run Given Waypoints In Order Without Stalling	action	rush-hour-arena	The route is already planned; the open problem is faithful, non-stalling execution of an ORDERED objective sequence — reaching each waypoint in turn and issuing the next move the instant a leg completes, rather than idling, re-deliberating, skipping a waypoint, or rushing straight to the end.	Manipulator / mobile-base task sequencing: a robot must hit a fixed ORDERED set of stations (pick → transit → place) within a cycle-time budget; pausing, reordering, or skipping a station misses the takt.
adv-asymmetric-weaker-must-win	Asymmetric Underdog — 2 Mediums Must Beat a Stronger Garrison via Flank-Pick	reasoning	generated	Two medium tanks must inflict enough kills on a stronger enemy garrison (a wall of rifle infantry plus a heavy tank on medium and hard) to satisfy a kill bar — without losing the pair. Head-on combat is a decisive LOSS: the heavy out-trades the medium column at close range and the infantry wall holds the pair in place long enough for the heavy to close. The only winning play is asymmetric: stage off-axis, approach the infantry on a flank that keeps the mediums outside the heavy's aggro envelope, pick off the infantry one at a time from the flank, and retreat past the heavy's leash any time it lunges. The skill measured is weaker-side reasoning: which sub-target to commit to (the soft infantry, not the hard heavy), which approach vector keeps the column outside the strongest threat's fire envelope, and when to disengage before being fixed.	Asymmetric / guerrilla doctrine for a force-on-force multi-robot encounter: a numerically and qualitatively weaker team must refuse decisive engagement with the strongest opposing asset, exploit terrain and mobility advantages, and pick off weaker support assets piecemeal where the local force ratio is favourable. The decision under test is approach-vector planning (stay outside the strongest threat's effective radius) and sub-target prioritisation (engage soft targets, not hard ones).	SC2 asymmetric, asymmetric warfare, guerrilla tactics
adv-rps-counter-pick	Scout-then-Counter — Hard Counter Selection Across Seeded Enemy Compositions	reasoning	generated	A FIXED budget funds EITHER an armoured fist (medium tanks 2tnk), OR a rocket squad (e3 anti-tank), OR a rifle company (e1 mass). The enemy archetype ROTATES per seed across four compositions (rifle swarm / heavy tank column / rocket cluster / second rifle swarm), each demanding a DIFFERENT counter. Pre-committing to one composition LOSES on at least one seed. The capability under test is the observation-driven counter SELECTION loop: scout the enemy with the starter jeeps, infer the threat profile, then commit the whole budget to the matching counter. There is no dominant single build — the right answer depends on what the scout reveals.	Threat-driven procurement under uncertainty: a fixed capital budget funds anti-armour platforms OR anti-infantry platforms OR foot soldiers; telemetry from forward scouts must confirm the threat profile BEFORE the commit. Pre-committing to an anti-tank loadout against an infantry threat (or vice-versa) is a cost-per-effect waste that loses the engagement. The decision under test is the read-then-commit loop, not a single memorised build.	SC2 hard-counter / scout-and-pivot doctrine, game-theoretic best response under composition uncertainty, military RPS (tank-vs-infantry / infantry-vs-rocket / rocket-vs-tank), capability-based defense procurement under uncertainty, PlanBench replanning on new observation
adversarial-duel	1v1 Combat Duel — Beat a Reactive Enemy at Close Range	adversarial	generated	A close-quarters force-on-force fight against an enemy that shoots back: the decision is combat micro — concentrate fire on one target, retreat damaged units, trade favourably. The enemy starts just to the east, in or near sight, so this is a DUEL, not a search (finding scattered enemies on a big map is tested separately). Difficulty escalates the opponent, not the search.	Adversarial multi-agent engagement at close range: prevail over a reactive opponent team by target selection and damage trading.
artofwar-indirect-approach	The Long Way Round — Take a Costly Detour Because the Direct Path Is Lethal	reasoning	generated	迂直之计 — "make the devious route the most direct." A LEASHED GUARD line (engine scripted bot 'guard') walls the short west→east lane to the objective: every guard auto-fires in range, lunges at any unit that comes within its aggro radius, and snaps back to post. Driving the column straight down that lane gets the whole force shot apart in a turn. The only winning policy abandons the short axis entirely — climb to the open flank, run the long way around the END of the wall, then turn in — a route whose progress toward the objective stays flat or dips for ~20 turns before it pays off. Tests temporal credit assignment over a long horizon against a greedy shortest-path / "just charge it" bias; idling never commits and loses on the clock.	Hazard-aware long-horizon routing: reject the locally optimal short path through a lethal interdiction zone for a circuitous survivable one (around the obstacle's flank) whose payoff is far delayed.
artofwar-lure-the-tiger	Lure the Tiger Off the Mountain — Pull the Strong Defender Off the Lane, Then Run the Main Force Past It	reasoning	generated	调虎离山 — "lure the tiger off the mountain." A STRONG leashed GUARD line (engine scripted bot 'guard') walls the only lane to the objective: every guard holds its post, auto-fires anything in range, lunges at the nearest foe within its aggro radius, and snaps straight back to its post the instant nothing threatens it. Driving the durable main body straight down the lane runs it into the full line (anti-tank rockets) and it is shot apart before it gets through. The only winning policy commits the fast bait on a DIVERGENT vector that comes just close enough to the line to make a segment of it lunge OUT after the bait — off the lane, toward its leash — then, in that transient window while the tiger is off the mountain, runs the main body through the briefly-open slot to the far objective before the line snaps back. Phase 1 (pulling the tiger off) yields ZERO objective progress and looks locally negative (the bait draws all the fire); only phase 2, through the vacated slot, scores. The distinction from decoy-sacrifice: here the credit problem is DISPLACE-AND-EXPLOIT — the bait must lure the defender off and the main force must exploit the reversible vacancy in time; preserving the force is what is rewarded, not spending a sacrificial decoy. Greedy "just charge the lane" and "do nothing" both fail; a bait that never gets close enough never displaces the tiger.	Two-phase manipulation against a reactive guarding obstacle: a no-reward enabling action (entice a leashed interceptor off the only corridor with a decoy on a divergent vector) must precede and temporally overlap the rewarded objective action (run the payload through the briefly-cleared corridor before the interceptor snaps back).
artofwar-sequenced-citadel	Staged Assault — Reach the Waypoints In Order, Then Seize the Citadel	reasoning	generated	攻其無備 — strike where unprepared, but only after the prerequisite moves. The mission is a STRICT ordered sub-goal chain: stage at A, then transit B, then seize the citadel C — and only after a deliberate hold (the strike is timed, not rushed). Reward lands only at C; A and B are unrewarded prerequisites whose ORDER is enforced (reaching C without having passed A then B counts for nothing). Long-horizon sub-goal sequencing with delayed terminal credit; a greedy beeline to C, or idling, fails.	Ordered multi-waypoint mission (stage → transit → objective) where only terminal success is rewarded, step order is hard-enforced (not merely graded), and the terminal action must wait out a hold — the Blocksworld/GAIA-style long-horizon plan.
build-defensive-skirt-corners	Defense Topology — Skirt the Building with a Pillbox in Every Corner	reasoning	generated	A single high-value building (the fact — your construction yard) sits at the centre of the map and is rushed CONCURRENTLY from all four diagonal corners (NE, NW, SE, SW). Right doctrine: a SKIRT of pillboxes — one planted in EACH of the four corner approaches — so every diagonal axis has its own overlapping field of fire. Wrong doctrine: massing every pillbox on a single corner; that satisfies a naive "we built four defences" count and holds one corner, but the three uncovered corner waves stride untouched into the fact and raze it. The win predicate makes the topology load-bearing: total pbox count is not enough — one pbox must sit inside a radius-4 disc in EVERY corner region, the pillboxes must KILL the rush, AND the fact must survive.	Distributed defense / quadrant coverage: when one central asset is threatened from all bearings, finite defensive capacity must be matched to the cardinality of the threat axes — a strongpoint in every quadrant — not massed on the loudest single approach. The same logic as air-defence sector coverage or a multi-front perimeter: an uncovered sector is an open envelopment lane no matter how heavily the others are defended.	MicroRTS pillbox placement, distributed defense, military quadrant doctrine
build-defensive-tower-cluster	Defense Topology — Cluster Pillboxes Around the High-Value Building	reasoning	generated	A single high-value building (the fact — your construction yard) is the protected asset, and a concentrated enemy band funnelled through a single narrow corridor charges directly at it. Right doctrine: a TIGHT CLUSTER of pillboxes WRAPPING the fact, overlapping fields of fire on the single point that matters. Wrong doctrine: a thin LINE of pillboxes strung between the fact and the corridor mouth — the line satisfies a naive "we built N defences" count but cannot stop a focused rush from reaching the fact, because the firepower is too dispersed to mass on the threat. The win predicate makes the topology decision load-bearing: total pbox count alone is not enough; ≥3 of the pillboxes must sit INSIDE the radius-4 disc around the fact, AND the fact must survive.	Critical-asset protection / defense-in-depth: when ONE node is the irreplaceable principal (the fortress keep, the SAM hub, the data-centre hardened core, the ambassador's residence), the right architecture is dense overlapping enforcement AT that asset, not a thin uniform perimeter that is everywhere but never massed where the threat actually comes. The same logic that gives you a single hardened crypto module behind layered firewalls — concentrate coverage AT the asset, not a uniform spread.	ERQA, military strongpoint, asset protection, defense-in-depth around critical infrastructure
build-defensive-tower-line	Build a Defensive Tower LINE Across a WIDE Front (Not a Cluster, Not a Scatter)	reasoning	rush-hour-arena	Where do you commit your defensive towers when the threat is a rush spread across the FULL WIDTH of the map — not pinched through a single corridor cell, but advancing on every row of a wide front? Military perimeter doctrine and firewall rule design both say: cover EVERY lane across the front, one post per row, so no enemy unit can slip past on an unguarded row. A single dense cluster on one row wastes overlapping fire on one cell while every other row stays open; a scatter near the base never engages the rush at all. The win predicate makes the LINE topology load-bearing — total pillbox count alone is not enough; ≥1 pillbox must sit on EACH of the front's rung rows (cell-exact via radius 0.5), AND those pillboxes must actually KILL the rush spread across the front.	Network firewall / Web Application Firewall rule placement: when every protocol/port could be the path of compromise, the right architecture is one rule per port across the full inspection surface, not three duplicated rules on one port while the rest stay open. Likewise a physical perimeter patrol covers EVERY approach lane across the front — a cluster at one waypoint or a scatter across unrelated nodes both leave the actual front lanes traversable. Defense in depth across a wide approach demands one responder per lane, not many responders at one waypoint.	ERQA, MicroRTS defense, military perimeter
build-engineer-rebuild-after-loss	Build-Engineer: Rebuild Power Plant After Mid-Episode Loss	reasoning	rush-hour-arena	Reactive replan after exogenous loss of a critical-infrastructure building (the Power Plant) mid-episode. An ongoing operation has its power-grid building unexpectedly destroyed by an enemy strike in the opening; the agent inherits a complete production chain (Construction Yard + Refinery + Power Plant + War Factory + Ore Truck + ore patch) and a reserve cash budget; the deadline is real. The model must notice the destruction (the live building count for the Power Plant drops to zero), commit the reserve to rebuilding the Power Plant (not to army units, not to a different structure, not to nothing) AND place it adjacent to the surviving Construction Yard so production stays online — fast enough that the rebuild closes the happened-before latch before the deadline bites.	Disaster recovery after a critical-infrastructure node is destroyed by an external event mid-mission. The autonomous operator inherits a working processing site, a reserve allowance sufficient for exactly one replacement substation, and a strict operational deadline. Recovery requires identifying the lost asset, committing the reserve to rebuilding the same kind of asset (not to defenders, not to expansion, not to nothing), and siting it safely — close enough to the surviving site that the powered loads stay online, far enough from the lingering threat that the new asset survives.	PlanBench replanning under exogenous loss, disaster recovery, exogenous loss, SC2 rebuild-after-trade
build-power-down-defensive	Power Down Non-Essentials Under Load — Reversible Load-Shedding	reasoning	rush-hour-arena	Grid load-shedding under a generator-down state: the agent starts with insufficient generation to cover its installed load (one Power Plant supplies 100 power; the installed drains total ~140), so the grid runs at NEGATIVE surplus from tick 0 (the engine slows production 50% in this state). The agent must REVERSIBLY shed non-essential loads via `power_down` (toggle the building off) so that provided ≥ drained — WITHOUT selling structures it will need later. Selling fails the `has_building` clauses on the load-bearing structures it ostensibly needs to keep around; powering-down the lone Power Plant collapses provided power and fails the `power_provided_gte:100` floor; stalling never restores surplus and times out.	Datacentre load-shedding when one generator trips: the operator must turn OFF non-critical lines (cosmetic lighting, secondary HVAC) so the surviving generator can carry the essential ones (life safety, core compute), WITHOUT decommissioning the shed equipment (it will be re-enabled the moment the second generator comes back). The correct primitive is a reversible disable, not a destructive one — this is exactly the `power_down` toggle vs `sell`.	PlanBench reversible-action planning, operations runbook / load-shedding, SC2 power management, lmgame-Bench resource-constraint reasoning
build-power-online-first	Build Power Online First — Grid Bring-Up Sequencing	reasoning	rush-hour-arena	Standard-operating-procedure (SOP) compliance under a strict happened-before constraint, exemplified by electrical-grid bring-up: the grid (the Power Plant) must be ONLINE before any powered load (the Refinery, Barracks, War Factory) can come up. The opening decision is the entire test — from a pre-placed Construction Yard and a tight budget, the agent must queue the Power Plant FIRST and only THEN the first production / economy building. A model that "just builds the goal building" (the refinery) first finds the order silently blocked (the engine enforces powr as a prerequisite of proc), wastes the budget retrying, and times out. A model that stalls or builds an army instead also misses the bar.	Standard operating procedure with a hard happened-before constraint: a robotic operator commissioning a new processing site must energise the substation BEFORE bringing up the refinery (or any other powered load). Attempting the refinery first is rejected by the interlock; reasoning about the order of operations from the spec is the entire decision.	PlanBench, SOP compliance, electrical-grid bring-up
build-production-throughput-multibuilding	Parallel Production Throughput — Build a Second War Factory to Hit the Quota	reasoning	rush-hour-arena	A delivery deadline forces a manufacturing-throughput decision. A fixed quota of vehicles must ship before a tight clock, and the agent starts with exactly ONE production line (war factory). A single serial line cannot clear the quota in time. The agent has cash for a SECOND identical line; building it and feeding both queues roughly doubles the per-tick output rate (parallel manufacturing — a second assembly line does not split the job, it doubles the service rate). The pack frames the classic queueing-theory call: when one server cannot meet the deadline, add a second server in parallel.	Fleet dispatch throughput under a delivery SLA: a warehouse must ship N orders before a deadline, but a single loading dock (one serial server) clears jobs too slowly to meet it. The operator has budget for a second identical loading dock; commissioning it and routing jobs to both docks doubles the dispatch rate. The decision is a parallelism call — recognise that the single server is the bottleneck and add a second server, rather than pushing harder on the one that is already saturated.	queueing theory, SC2 multi-factory throughput, manufacturing parallelism
build-rally-point-management	Set the Rally Point Forward — Production Logistics Under a Deadline	action	rush-hour-arena	Production logistics / warehouse SLA: a factory whose default ship-to is the loading dock will pile inventory at the dock — pieces never reach the field. The decision is a single ACTION — set the production building's RALLY POINT (its "ship-to" address) to a FORWARD staging area at the front, BEFORE the production clock starts burning. Once the rally is set, every subsequent freshly-built unit auto-routes to the front; the operator does not have to micromanage each batch. Without that one-shot reset, the default rally is right next to the building and the units pile at the base ~38 cells from the action, never engage in time, and the SLA (the tick deadline) is missed.	Production-pipeline standing-order routing: a manufacturing or fulfilment system whose default routing rule sends finished goods to the dock will starve the downstream customer; a single standing-order change ("ship every unit produced at this site to dock B at the forward distribution centre") re-routes the entire pipeline. This is also the SC2 rally-point primitive — one click on the production building, every subsequent unit walks to the rally cell — and the analogue is exactly what real-world logistics calls "set the default ship-to".	SC2 rally management, production logistics, warehouse SLA
build-repair-priority-under-fire	Repair Priority Under Fire — Triage the Refinery, Not the Decoy	reasoning	rush-hour-arena	Three of your structures are under simultaneous attrition and you have one repair organ to commit. The damage picture is deliberately misleading: the pillbox (pbox) is pre-damaged to ~30% HP and LOOKS the most damaged, but it is low-value, heavily armoured, and the grenadiers barely scratch it — it survives on its own. The refinery (proc) starts intact but is on a LETHAL trajectory: its grenadier band kills it within a few turns unless you toggle `repair` on it immediately. The intended decision is criticality-weighted triage: rank by value x lethal-trajectory, not by raw damage percent. Repair the proc first (and on the harder tiers the war factory too); skipping the refinery to "fix the worst-looking building" loses it, and stalling loses it.	SRE / disaster-recovery incident triage: several services are degraded at once and the operator has one repair lever. The correct move is to rank by criticality x blast-radius, not by whichever dashboard is reddest — restore the income-bearing / load-bearing subsystem that is about to fail hard before touching a loud-but-low-value one. The same criticality-weighted prioritisation governs SC2 SCV-repair target choice and aircraft / plant preventive maintenance.	disaster recovery triage, criticality-weighted maintenance, SC2 SCV repair
build-sell-and-rebuild-elsewhere	Sell and Rebuild Elsewhere — Recoup Capital, Relocate Production	reasoning	rush-hour-arena	A forward refinery (proc) sits in the path of an incoming enemy hunt band that will raze it within ~25-30 turns; starting cash alone does not cover building a new proc at the safe target region. The only path to a fresh proc inside the tick budget is to SELL the exposed proc (recouping 50% of its build cost) and use the refund plus starting cash to BUILD a new proc and PLACE it at the safe target region far from the rush. Stalling, building without selling (cash gated), and placing the new proc in the wrong region all lose; only sell-then-rebuild-at-safe-region wins.	Liquidate a deteriorating asset to fund a relocation: a forward production node is about to be lost to environmental damage, and the capital reserve alone is insufficient to commission a replacement node elsewhere. The right move is a deliberate salvage of the at-risk node (recovering ~half the build capital in liquid form) which, combined with the on-hand reserve, funds a new node at a safer site BEFORE the original is lost for zero recovery. Letting the asset be destroyed loses 100% of its capital; salvage-and-redeploy preserves 50% + funds the new site.	capital reallocation, SC2 sell mechanic, financial reallocation
build-sequence-tech-cheapest	Cheapest War Factory — Cost-Minimal powr → proc → weap Build Order	reasoning	rush-hour-arena	Cost-minimal build-order planning under a fixed, non-replenishing budget: the agent must reach the war factory (`weap`) by spending the LEAST cash on the ONLY affordable prerequisite chain (powr → proc → weap). There is no ore and no income — the starting cash is the entire budget, tuned to exactly the cost of the minimal path. Any detour through a non-load-bearing structure (a barracks, a pillbox, an early infantry unit) bloats the bill of materials and exhausts the budget; the war factory can then never be funded. Tests that the model can plan the minimum-COST prerequisite chain — not merely SOME plan that arrives — under a budget that only the cost-minimal plan can afford. Sibling of build-sequence-tech-fastest (the time-optimal axis); here money, not the clock, is the constraint.	Capital-minimal commissioning of an autonomous manufacturing cell under a fixed procurement budget: the cell must bring a target machine online by buying only the strictly required upstream stations (power → feedstock → assembly). The budget is fixed and non-replenishing; procuring a non-required station (an extra quality bench, a spare buffer) spends capital that the assembly station then cannot be funded with. Only the minimum-cost precedence chain stays within budget.	PlanBench cost-optimal, BOM cost minimization, budget-constrained planning
build-sequence-tech-fastest	Fastest War Factory — Cost-Optimal powr → proc → weap Build Order	reasoning	rush-hour-arena	Cost-optimal build-order planning under a tight deadline: the agent must reach the war factory (`weap`) on the shortest prerequisite path (powr → proc → weap). Any detour through unneeded structures (a barracks, a second power plant, an early infantry training queue) bloats the bill-of-materials and overruns the budget. Tests that the model can plan the minimum-cost prerequisite chain — not merely SOME plan that eventually arrives — under a deadline that only the optimal plan satisfies.	Critical-path planning in autonomous manufacturing: a cell must bring a target machine online by a fixed cycle-time, choosing the minimum set of upstream stations to commission first (power → feedstock → assembly). Adding non-load-bearing stations to the ramp-up plan (a non-required quality station before assembly) blows the deadline; only the cost-optimal precedence chain meets spec.	PlanBench cost-optimal, BOM manufacturing
build-sequence-tech-most-resilient	Resilient War Factory — Redundant Power Survives a Strike (N+1 Build Order)	reasoning	rush-hour-arena	Robust build-order planning: reach a tech capability AND keep it through a foreseeable disturbance. The agent must bring a war factory online and field an armoured force, but a mid-episode enemy strike razes one power plant. A build order that provisions only a single power plant is a single point of failure — when the strike lands the factory drops to low power and the army never completes in time. The resilient build order pre-builds a second, redundant power plant before the strike, so the grid stays in surplus and production never slows. Tests whether the model plans for the disturbance (N+1 redundancy on the critical prerequisite) rather than merely planning the shortest path to the goal.	N+1 redundancy on a critical utility. An autonomous production cell depends on a power feed to run its assembly machine; a known hazard will knock out one feed mid-shift. Resilient planning commissions a second, independent feed BEFORE the outage, so the assembly machine never drops below rated throughput. Provisioning only one feed — the shortest plan to first article — halts the line the moment the hazard strikes and blows the delivery deadline.	PlanBench robust planning, N+1 resilient design, redundancy
build-tech-skip-decision	Skip the Unneeded Tech Tier — Clear a Light Garrison with a Basic e1 Swarm	reasoning	rush-hour-arena	The objective only requires basic units. The agent starts with a Construction Yard and an Allied barracks already standing, so cheap rifle infantry (e1) are trainable from turn 1 with no prior tech step, and a light enemy garrison — also basic infantry — is incoming. An e1 swarm clears a light infantry garrison comfortably. The trap is to climb the full tech chain (power plant → refinery → war factory → service depot → medium tanks) to bring a higher tech tier online that the objective never asked for: that whole tier costs budget and, critically, clock — by the deadline no tank has fielded at all. The pack tests whether the model recognises that the goal does not require high tech and skips straight to the cheap path.	Right-sizing the plan to the task. A delivery only needs a basic pick-and-place cell that is already commissioned; provisioning a full high-precision assembly line (extra power, feedstock, an assembly station, a calibration depot) before starting work is a whole capability tier the job never required — it burns budget and blows the cycle-time deadline. The competent planner prunes every step the goal does not consume and executes the minimal plan with the capability already on hand.	PlanBench unnecessary-step pruning, lean process, YAGNI
building-and-planning	Base Building — Construct Structures Respecting the Tech Tree	reasoning	rush-hour-arena	Construction planning under dependency and spatial constraints: decide a build order that respects engine-enforced tech-tree prerequisites (a barracks needs power; a pillbox needs the barracks), place structures where the objective requires (a defended direction), and when needed creep the base across the map to found the defensive line in a designated far region. The decision is the plan — order, placement, and relocation — not the motor control of any single build. The prerequisite is genuinely enforced by the engine: a power-less barracks NEVER completes, so a greedy "build the goal building first" policy cannot win.	Autonomous construction / facility-layout planning: a task-graph with prerequisites (B needs A, C needs B) plus spatial goals (assemble in zone Z; relocate and found the depot near region R) under a time budget — out-of-order or mis-placed construction never satisfies the goal.
combat-attack-from-behind-fog	Reasoning — Attack from Behind via a Fog-Bypass of the Frontal Line	reasoning	rush-hour-arena	Four medium tanks (2tnk) at the west edge face a tight vertical wall of pillboxes (pbox) and anti-tank rocket soldiers (e3) at x=50, spanning y=15..25 and facing west. The objective is an UNDEFENDED enemy construction yard (fact) at (100,20) — behind the line. Charging head-on along y=18..22 puts the lead tank inside the kill envelope of 4+ defenders simultaneously and the column never clears the line in time to raze the fact. The winning play is the FOG FLANK: route the strike force to the far north (y=2) or far south (y=38) — well outside the line's range AND vision — drive east past the line's longitude, then turn inward to descend on the fact from BEHIND. Same enemy, same forces, but the route of approach controls whether the line's prepared fields of fire can engage at all.	Spatial-reasoning under prepared defenses: an intelligent attacker refuses the defender's chosen geometry (the frontal kill envelope) and routes via a corridor outside the defender's sensor + weapon coverage to reach an undefended high-value target. The decision under test is route planning — recognise that a static defensive line is fixed and bypass-able, not that it must be reduced face-to-face.	SC2 hidden assault, military surprise attack, fog warfare
combat-bait-counter-attack	Bait + Counter-Attack — Pull the Guards Off Post, Strike the Undefended Yard	reasoning	rush-hour-arena	A reactive defending cluster (scripted `guard` bot: holds its post, auto-fires in range, lunges at the nearest foe within an aggro radius, snaps back past a leash) is bunched on the approach side of the enemy construction yard. Driving the strike force straight in eats anti-tank fire from the whole cluster and runs the attrition cap before the objective falls. The winning policy commits a cheap fast bait (jeep) on a DIVERGENT vector that comes just close enough to make a SIDE of the cluster lunge off post after the bait, then runs the strike tanks around through the now-vacated flank to destroy the construction yard while it is briefly undefended. Phase 1 (the bait pull) yields zero objective progress and looks locally negative; phase 2, through the slot the bait opened, scores. The credit problem is FEINT-AND-FLANK: the bait must displace a segment of the cluster and the strike must exploit the reversible vacancy in time; bait-only never razes the yard, brute frontal trades the strike force against the full cluster, stalling loses the clock.	Sacrificial-decoy planning against a reactive guarding obstacle: a no-reward enabling action (entice a leashed interceptor off the objective with a decoy on a divergent vector) must precede and temporally overlap the rewarded objective action (commit the payload force on the briefly-cleared flank to destroy the objective before the interceptor snaps back).	SC2 bait micro / sacrificial pull, military feint-and-flank doctrine, CICERO / Diplomacy deception
combat-divide-and-conquer	Combat Reasoning — Divide and Conquer Two Mutually-Supporting Enemy Clusters	reasoning	rush-hour-arena	Four medium tanks (2tnk) at the west edge face TWO enemy clusters at x=60 — a NORTH cluster at y=15 and a SOUTH cluster at y=25 — each composed of 3 anti-tank rocket soldiers (e3, Dragon, range ~5) plus 1 light tank (1tnk). The clusters are 10 cells apart; pushing east on the midpoint axis (y=20) puts the lead tank inside weapon range of BOTH clusters at once, and under AttackAnything stance the 8-unit mass converges and concentrates fire on the lead, busting the survival bar before either cluster is cleared. The winning play is divide-and-conquer (defeat in detail): flank well NORTH (e.g. y=5) so only Cluster A's units are in range and line-of-sight; eliminate A while Cluster B is still ≥20 cells away and closing; then pivot SOUTH (y=35) and repeat against B in isolation. Each engagement is a clean 4-vs-4 trade instead of a 4-vs-8 mass.	Two-front engagement with mutually-supporting defenders: hitting the midpoint between coupled adversaries lets BOTH fire on the leading agent simultaneously, doubling the incoming DPS share. Spatial positioning that breaks line-of-sight or stays outside the second adversary's aggro / weapon envelope sequences the encounter into two 1-vs-1 trades (favourable force ratio per trade), at the cost of route length and clock.	SMAC squad-isolation, CICERO splitting, military divide-and-conquer
combat-flanking-attack	Combat Micro — Flank a Frontal Anti-Tank Line Instead of Charging Head-On	action	rush-hour-arena	Four medium tanks (2tnk) at the west edge face a tight vertical line of anti-tank rocket soldiers (e3) at x=60, fronted by a shield of rifle infantry (e1). The line "faces west" — every defender's weapon envelope covers a tank approaching head-on along the engagement axis (y=20). Charging head-on puts the lead tank inside Dragon range of EVERY rocket soldier in the stack simultaneously and inside small-arms range of the entire shield; concentrated fire destroys the lead before the column clears the line, busting the survival bar. The winning play is the flank: move the strike force off-axis (north or south of the line, well beyond y=22 or y=18), then approach the line END-ON so only one or two defenders are in range at any moment. Same enemy, same forces, but the angle of attack controls how many enemy barrels can bear on the leading tank.	Multi-agent strike-package geometry: the bearing of contact against a static defender controls how much of the defender's firepower can engage simultaneously. The decision under test is spatial routing — refuse the high-attrition frontal trade and approach via an axis that sequences the engagement 1-vs-1 rather than 1-vs-N.	SC2 flank micro, military flank maneuver doctrine, tactical: avoid frontal trade, force-multiplier through angle of attack
combat-focus-fire-priority	Focus-Fire Priority — Kill the Anti-Tank Threat FIRST	action	rush-hour-arena	Four medium tanks face a small mixed enemy squad at close range: a single high-DPS rocket soldier (anti-vehicle) escorted by 2-3 rifle infantry, all visible in the centre. The decision is target prioritization: concentrate ALL tanks' fire on the rocket soldier FIRST (4-vs-1 kills it in 1-2 decision turns before it can fire more than a couple of rockets), then mop up the infantry. Spreading fire across the squad — via attack_move auto-targeting or by starting on the closer rifle infantry — leaves the rocket soldier alive long enough to kill ≥2 tanks, busting the attrition cap. A brute attack-anything play LOSES; the focus-fire play WINS.	Military strike-package doctrine — when engaging a mixed enemy package, the high-value / high-DPS asset (AA battery, anti-armour weapon, command vehicle) is neutralised FIRST, even when it is not the nearest target. The lower-DPS escort is mopped up after. Target prioritization under fire is the controlled capability: nearest-first or auto-target spreads damage and lets the priority threat keep firing.	SC2 focus-fire micro, MicroRTS target prioritization, military strike-package: hit AA first, RPS-counter unit prioritization
combat-formation-tank-wedge	Combat Micro — Tank Wedge Through a Bracketing Fire Corridor	action	rush-hour-arena	Five medium tanks (2tnk) at the west edge must drive east through a bracketing fire corridor (anti-tank rocket soldiers nested above AND below the engagement axis, plus a single light tank blocker on-axis) and reach an objective region in the east with most of the force intact. Marching as a column along y=20 puts the lead tank inside Dragon range of BOTH brackets at once; concentrated cross-fire destroys the lead, then the next tank inherits the kill envelope, and the column bleeds itself dry before clearing the gap. The winning play is the WEDGE — lead tank on-axis at y=20 absorbing the on-axis blocker's fire, flankers offset to y=18 and y=22 trailing one cell west so they engage each bracket END-ON from off-axis (only 1-2 rocket soldiers from each cluster can fire on a flanker at once). The formation's SHAPE controls how much enemy firepower can bear on the lead unit and sequences the bracket engagements into 1-vs-1 trades instead of N-vs-1 crossfire.	Multi-agent strike-package formation: the geometric SHAPE of the moving formation controls weapon-bearing concurrency on both sides. A column maximises target overlap on the lead unit (worst case for surviving cross-fire); an inverted-V wedge spreads the engagement across the formation's width so each enemy bracket faces only 1-2 of the wedge's units at a time, and the lead's job is to ABSORB the on-axis threat while the flank wings dismantle the off-axis threats. The decision under test is formation discipline: order the force into a wedge BEFORE the contact, not after losses force a retreat.	military tank-wedge doctrine, SC2 formation micro, combined-arms, strike-package geometry: formation shape vs cross-fire
combat-harass-aggro-commit	Combat Harass — AGGRO Commit: Fight Through the Defender	action	rush-hour-arena	A small raider force is staged west; a cluster of enemy harvesters works an ore patch around their refinery in the centre-east, with a single heavy tank standing on-post as the defender. The high-score doctrine is to COMMIT — concentrate fire on the defender first (3-vs-1 tank trade favours the attacker), then mop up the undefended harvesters. Stalling or pure retreat misses the kill bar; attacking the harvesters while ignoring the defender lets the heavier 3tnk pick off the raiders one by one. The intended play accepts some attrition to score both targets: aggro raid doctrine, not skirmish-and-pull-back.	Military forward-attrition / aggressive raid doctrine — a small expeditionary force commits to a higher-impact engagement (defender + payload) rather than disengaging at first contact, accepting losses to achieve the larger objective. The decision is threat-priority and commitment, not micromanagement of retreats.	SC2 aggressive worker harass with commit, military forward attrition warfare, RTS aggro doctrine: accept losses for higher kill ratio, guerrilla raids with kill-everything objective
combat-harass-balanced-hit-and-run	Balanced Hit-and-Run — Pulsed Worker Harass under a Leashed Defender	reasoning	rush-hour-arena	BALANCED pulsed harassment under a reactive defender. Two fast raiders (jeeps) must score N kills against a cluster of soft enemy workers overwatched by a leashed HEAVY TANK (anti-vehicle, immune to the jeeps' anti-infantry MG and lethal to them in one volley). The defender holds its post a few cells off the cluster but lunges at any foe within AGGRO=16 of the post and snaps back past LEASH=18. Committing inside the guard's cannon envelope until the workers die loses jeeps (cap is 0 on medium — the BALANCED bar is that retreat ACTUALLY worked). Sitting out of aggro never scores. The only winning policy is the pulse: approach to MG range of the workers → strike one → retreat past leash → guard snaps back → re-engage → repeat. Retreat is load-bearing — the first pack in the bench where it is required, not merely permitted.	Pulsed manipulation under a reactive guarding agent: the policy must complete a sequence of small reward-bearing actions inside the guard's intervention radius, then disengage past the leash before damage is taken, then re-enter — without taking any hit at all. A commit-until-done policy gets damaged; a stand-off policy never scores. The temporal structure of the reward (cycle of engage / disengage) is what is being tested.	SC2 muta-harass / sentry-harass (pulsed engagement), military guerrilla warfare (hit-and-run doctrine), pulsed-load attack with retaliation avoidance, RTS balanced harass: kill workers without losing raiders
combat-heli-flank	Heli Flank — Air Mobility Over a Ground Wall	action	rush-hour-arena	A two-helicopter strike must engage an enemy infantry cluster pinned behind a contiguous wall of pillboxes — a barrier that denies any ground push inside the tick budget. The decision under test is air-mobility recognition: only the helicopter `move_units` / `attack_unit` flight path crosses the wall; sending a ground unit at the same target stalls against the impassable footprint and never reaches the cluster. Stall (`observe` only) never engages; a ground brute push (no air units exist in the agent force, so this degenerates to "no valid plan") fails the kill bar.	Vertical envelopment / airmobile assault doctrine — the helicopter exists to bypass terrain that denies the ground arm. The same logic as a UAV swarm routing over a denied surface corridor: pick the right modality for the obstacle.	AH-64 deep-attack flight profile, airmobile vertical envelopment
combat-hold-chokepoint	Combat Micro — Hold a Narrow Pass Against a Larger Force	action	generated	A small medium-tank squad must defeat a numerically larger enemy light-tank force by HOLDING the chokepoint — a per-tier-wide corridor that is the only path across the map. The terrain caps how many attackers can bring weapons to bear at once: anchored at the corridor mouth, the squad faces only the few enemies that fit the lane abreast and grinds the larger force down piecemeal. The same squad fighting in the OPEN — having charged east through the corridor, or having pulled west and let the enemy spill into the open — is surrounded by the whole force at once and focus-fired down. The decision under test is positional: anchor the squad IN the chokepoint where the geometry does the work.	Defensive positioning at a terrain bottleneck: a small team holds a corridor / doorway / bridge so an adversary's numerical advantage is neutralised by frontage — only as many opponents as the bottleneck is wide can engage simultaneously. Choosing to fight where the geometry caps the enemy's effective force (rather than on open ground where the full force concentrates) is the load-bearing spatial-reasoning decision. Anchors: military chokepoint / defile defense, Thermopylae (480 BC), the StarCraft 2 ramp hold.	military chokepoint defense, Thermopylae, SC2 ramp hold
combat-kite-and-pull	Combat Micro — Kite and Pull a Slow Heavy Force	action	generated	A fast light strike force must destroy a slower, heavier enemy that out-trades it head-on. The only winning play is the hit-and-PULL cycle: each turn, strike the heavy at weapon range, then RETREAT the strike force out of the heavy's lethal close-range window before it can fire back — and repeat. Standing and fighting LOSES: the heavy cannon collapses the light force's HP before its own runs out. The skill being measured is combat micro under a mobility asymmetry — exploit the speed edge by stringing together move-away + attack cycles instead of issuing one beeline charge.	A fast/light agent team defeating a slow/heavy adversary by exploiting a mobility asymmetry: a closed-loop evade-then-engage policy rather than a one-shot commit. The per-turn decision is proximity control — stay outside the adversary's lethal radius while delivering effect at standoff range.	SC2 kiting micro, cavalry skirmish doctrine, military fire-and-maneuver doctrine, economy-of-force
combat-kite-jeep-vs-tank	Combat Micro — Kite a Slow Heavy Tank with Fast Raiders	action	rush-hour-arena	Three fast tank raiders must kill ONE enemy heavy tank that is actively HUNTING the raiders' centroid. The only winning play is kiting: each turn, if the heavy tank is closing into one-shot range, MOVE the raiders AWAY (using their speed advantage) and attack_unit the heavy from outside its lethal close-range window. Repeat until the heavy falls. Stand-and-fight LOSES — the heavy tank's cannon out-trades raider weapons at close range and a static engagement collapses raider HP before the heavy's. The skill being measured is combat micro: target a moving threat, exploit the unit-speed asymmetry, and string together a sequence of move-away + attack_unit cycles instead of issuing one beeline order.	Fast/light agents defeating a slow/heavy adversary by exploiting a mobility asymmetry: a closed-loop policy of evade-then-engage, rather than a one-shot beeline. The decision under test is per-turn proximity control (stay outside lethal radius, fire at range).	SC2 kiting micro (vulture/muta-vs-marines), cavalry-vs-pikeman maneuver, military fire-and-maneuver doctrine, skirmisher tactics
combat-naval-shore-strike	Naval Shore Strike — DD shells a coastal garrison	action	generated	A destroyer on water faces a small infantry garrison on the adjacent shore. The decision: target the garrison with the destroyer's primary armament from across the water (the destroyer cannot move onto land). A stall policy never engages; a policy that tries to drive the destroyer ashore stalls at the shoreline.	Naval gunfire support — the asset operates in a domain (water) that constrains its movement and weapon employment, but its weapon range crosses the domain boundary onto land.	naval-mvp, amphibious gunfire support
combat-pincer-coordination	Pincer Attack — Two Squads Strike Simultaneously From Two Sides	action	rush-hour-arena	Two armoured squads start at OPPOSING latitudes on the west edge (one to the north, one to the south) and must converge on a central enemy cluster simultaneously, hitting the defender from two sides at once. Sending a single squad alone fails on two counts: only 3 tanks (not the required 4) can occupy the objective region, and the lone squad is shredded by the cluster's mass anti-armour before clearing it. Sending both squads but not synchronised (one held back, the other commits first) lets the cluster focus on the lead squad and destroy it before the trailing squad arrives, busting the attrition cap. Only a true simultaneous two-prong commit clears the cluster cleanly and leaves the joint force standing on the objective.	Synchronous two-team pincer attack on a contested objective: each team approaches on a different bearing and both must arrive within a common window so the defending agents cannot focus on either team in detail. Serialising one team behind the other lets the defender concentrate fire and destroy the lead team before reinforcement; only the joint simultaneous commit overwhelms the defence and preserves the strike force.	SC2 multi-prong / two-prong attack timing, military pincer movement doctrine, envelopment from multiple angles, synchronisation of dispersed forces
combat-prevent-retreat	Combat Encirclement — Cut Off the Enemy's Retreat Before You Strike	action	rush-hour-arena	A cluster of enemy rifle + rocket infantry sits at the centre of the map. A head-on column charge brings every tank within rocket range of the cluster and bleeds the strike force below the survival cap. The winning play is the Cannae / encirclement idiom: detach ONE tank on a flank route (around the enemy cluster via y=5..10 or y=30..35, out of rocket range) to take the eastern anvil position at (85,20), then commit the main body of THREE tanks from the west. The win predicate explicitly requires the agent to have ESTABLISHED an eastern cut-off (≥1 own unit at (85,20,r=8)) at the moment the kill bar is met AND to have ≥3 tanks alive. Brute charging without the cut-off bleeds the force under the survival cap — a column attack-moving into the compact stationary e3 wall loses ≥2 tanks whether it heads east or for the centre. Stalling never opens fire, also a LOSS on the clock.	Multi-agent strike geometry: the placement of a "blocker" / "anvil" unit on the opposite side of the target BEFORE the main engagement opens is what converts a shove into a kill. The decision under test is spatial sequencing — establish the cut-off via a path that avoids the enemy's effective weapon envelope, THEN engage; reversing the order forfeits the capability that the cut-off enables.	military encirclement, Cannae doctrine, SC2 encirclement, the hammer needs an anvil
combat-protect-vip-escort	Protect the VIP — Escort a Fragile Unit Through the Hazard	action	rush-hour-arena	A single fragile high-value unit (the VIP — an unarmed harvester, the only one of its kind on the map) must reach the east extraction point. Four medium tanks at the same west staging zone are the protective detail. The VIP cannot fight (no weapon); an enemy interception force — a Soviet heavy tank backed by rifle infantry — patrols the route and runs the lone VIP down. The escort must move AHEAD, destroy the interceptors, and only then let the VIP cross. Sending the VIP alone or at the head of the column gets it caught and killed; the escort's offence preserves the VIP.	Protected-asset transit / VIP detail / diplomatic convoy: a high-value principal (unarmed, fragile) must traverse a contested route. The protective detail moves in bounding formation ahead and to the flanks, neutralising threats along the route before the principal traverses. The principal does not engage directly — its survival depends on the detail's offence, not its own fight.	SC2 VIP-survival missions, military VIP protection doctrine, diplomatic security / convoy escort, fragile-target preservation tactics
combat-retreat-after-engagement	Combat Retreat — Disengage to Preserve the Force	reasoning	rush-hour-arena	Four medium tanks face a numerically superior enemy squad (rocket infantry + heavy tanks) at close range. The fight is unwinnable head-on: the rocket infantry and heavy cannons together out-trade the column inside the loss cap. The only winning play is the SC2 retreat-micro idiom: commit briefly to score the kill bar on the soft anti-tank infantry (e3), then break contact and pull the surviving tanks BACK to the safe rally point in the west before attrition busts the force-preservation cap. The decision under test is disengage TIMING — stalling never engages (kill bar unmet), brute attack-until-death loses the force, and only the engage-then-retreat play wins.	Military tactical-withdrawal doctrine and the SC2 retreat-micro pattern — when a battle goes poorly, the policy must IDENTIFY the losing trade and pull back to a safe rally before the unit is destroyed. The skill is reading the engagement state and issuing a withdrawal order at the right moment; over- or under-commitment both fail. Preservation of force is its own objective.	SC2 retreat micro / disengage timing, military tactical withdrawal doctrine, preservation of force / live-to-fight-another-day, skirmish-and-pullback tactics
combat-rocket-soldier-anti-vehicle	Rocket Soldier vs Heavy Armour — Pick the Hard-Counter Unit Type	reasoning	rush-hour-arena	A starting cash budget that funds exactly one coherent composition must be allocated against a pre-placed enemy band of HEAVY TANKS (3tnk Soviet heavy on easy/medium, 4tnk Mammoth on hard). The agent must train ROCKET SOLDIERS (e3, the anti-vehicle Dragon launcher), not light tanks (1tnk — the budget buys only ~2, which lose attrition to heavy armour) and not rifle infantry (e1 — no anti-armour weapon, racks up zero kills against tank armour). The decision is the RPS counter CHOICE (matchup-winning unit type) given the visible threat composition.	Anti-armour procurement against an armoured concentration: a fielded inventory of TOW / Javelin / RPG launchers defeats a main-battle-tank column, while an equivalent budget spent on light AFVs or rifle squads is force-on-force inferior. Asset CLASS — not cost or count — determines the engagement outcome.	SC2 hard-counter, anti-armor procurement, military RPS
combat-skirmish-then-disengage	Combat Skirmish — Strike, Score the Kills, Pull Back to Recovery	action	rush-hour-arena	SKIRMISHER doctrine in the single-engagement frame: four fast raiders (jeeps) must drive east into a slow infantry cluster, score AT LEAST 3 kills, and then PULL BACK to the recovery zone around the western start before the clock expires AND while keeping at least 3 raiders alive. The skill under test is the decision to STOP FIGHTING and disengage — committing until the enemy is wiped or until the strike force is destroyed both LOSE (commit leaves the raiders at the kill site instead of the recovery zone; over-commit on hard loses raiders to the hunt-bot spawn waves). Distinct from the BALANCED pulsed harass-retreat cycle (combat-harass-balanced-hit-and-run, which is many small pulses with zero attrition): this pack is ONE big engagement done well, with a positional recovery bar.	Mission-with-egress: a mobile manipulator must complete a threshold of reward-bearing actions in a contested workspace, then return to a safe staging region before a time or attrition budget expires. Knowing WHEN to stop the productive sub-task and start the egress is the decision under test — a productivity-only policy (greedy accumulation) leaves the agent far from the staging region at deadline and fails the egress clause.	SC2 skirmisher tactics, military reconnaissance-by-fire, harass-and-disengage doctrine, armoured cavalry doctrine
combat-stance-mgmt-attack	Hunt Authorisation — Lift Stance to Pursue Scattered Enemies	action	rush-hour-arena	An engagement-authority escalation drill: four medium tanks are pre-staged at the west edge of the arena on RETURN-FIRE (stance:1) — they will not open fire unless attacked. A scattered enemy force (riflemen + a light tank) is spread across the eastern half of the map at positions that don't bring the fight to the agents. The mission is to call `set_stance(units, 3)` to escalate the formation to AttackAnything so the engine's stance:3 hunt path advances each tank to the nearest scattered enemy and wipes them out. Without the escalation the formation idles (return-fire never triggered, hunt never licensed) and the deadline bites as a real LOSS.	A patrol-vs-pursuit authority switch: a perimeter defence team is on weapons-tight standing orders (return fire only). The operator detects scattered hostiles deep in the perimeter that aren't engaging the defenders. The operator must issue the "weapons free, hunt and clear" command to escalate the team from defensive return-fire to active pursuit. The skill is the operator's decision to ESCALATE the engagement-authority at all — not the precise moment within the engagement window.	military ROE escalation, SC2 stance micro
combat-suicide-charge-mission	Suicide Charge — Sacrifice the Force to Raze a High-Value Objective	reasoning	rush-hour-arena	Forlorn hope / military sacrifice doctrine. A high-value enemy construction yard (`fact`) sits at the far east of the map, defended by an anti-armor picket; the agent's small strike package at the west cannot punch through without heavy losses. The OBJECTIVE is the building's destruction — keeping the force intact is NOT required and is not achievable. A "preserve the force" policy that engages carefully and tries to save units loses on the clock (the picket grinds the lead element down and the rest never arrive); a half-commit that advances and halts short of the objective also times out. Only an all-in commit that drives the WHOLE force decisively at the objective and focus-fires through the picket — trading most of the attackers for the kill — actually razes the fact in time. This is delayed-terminal credit assignment with an explicit cost-vs-objective trade-off: every unit lost looks locally negative, but the only successful plan accepts that the bulk of the force is spent and only a remnant survives to land the killing blow.	Expendable strike package / single-use intervention: a fleet of cheap drones must destroy a critical adversary asset under a hard deadline; loiter or stand-off attacks miss the window, so the planner must commit every platform on a one-way attack trajectory, accepting total platform loss as the operating cost of mission success.	military sacrifice doctrine / forlorn hope (West & East canonical), SC2 expendable strike package (all-in cost-objective trade), MicroRTS / SC2LE attack-the-base under deadline with attrition-OK
combat-tank-vs-tank-engagement	Tank-vs-Tank Mirror — Focus-Fire, Lanchester Square Law	action	rush-hour-arena	A three-tank strike force engages a stationary enemy tank line. The decision under test is combat micro: close to cannon range, HOLD the engagement at range, and concentrate `attack_unit` fire on one target at a time — eliminate the nearest enemy, then the next, working down the line. Per the "concentration of force" doctrine and the Lanchester square law, a force that holds and focus-fires removes enemy OUTPUT DPS one whole tank per kill and clears the line keeping its strength; a force that brute `attack_move`s straight INTO the enemy position bunches itself in the enemy's midst, absorbs the whole line's crossfire at once, and is wiped before it can clear the engagement. On medium the agent is numerically out-gunned 4-vs-3, so the controlled engagement is load-bearing: only a held, concentrated focus-fire push clears ≥3 of the 4 enemy tanks while keeping ≥2 of its own. Stalling never engages and loses on the kill bar; the brute drive-in loses on the survival cap / kill bar; only the controlled focus-fire engagement wins.	Military "concentration of force" doctrine (one of the Principles of War): a smaller or equal force concentrated at the decisive point can defeat a numerically equivalent dispersed enemy. The per-kill removal of enemy OUTPUT DPS is the closed-form Lanchester square-law advantage of concentrated fire; the test mirrors the SC2 mirror-tank micro / marine-vs-marine engagement where the side that focus-fires one target at a time wins the trade against a numerically equal foe spreading fire across the whole line.	SC2 mirror micro, Lanchester square law, concentration of force
combat-tanya-vs-rush	Tanya vs e1 Rush — Hero Engagement	action	rush-hour-arena	A single Allied commando (Tanya) holds the centre against a small rush of conscript rifle infantry. Tanya is a hero unit: 3x the HP of a basic rifleman, fast, and her sidearm out-DPSes a small e1 pack. But she spawns on hold-fire stance — a do-nothing policy leaves her standing still under fire and she dies. The decision is "actively engage the hero asset" vs "observe / stall". The intended policy issues attack-move (or flips her stance to defend /attack-anything) and wins; stall loses.	Hero / commando doctrine — a single high-DPS asset is force-multiplying only when active. A door-kicker robot held in standby while threats approach is wasted. The capability is "commit the hero asset to the engagement" — operators who fail to commit, lose the asset and the engagement.	RA hero unit (Tanya commando), RTS micro: engage the hero, asymmetric high-value asset doctrine
combat-target-priority-highvalue	Target Priority — Kill the High-Threat Units FIRST	action	rush-hour-arena	Four medium tanks face a mixed enemy cluster at close range: a screen of cheap rifle infantry (e1 chaff) backed by THREE high-threat anti-armour rocket soldiers (e3). The decision is threat-weighted target prioritization: concentrate ALL FOUR tanks' fire on the rocket soldiers FIRST (each dies in 1-2 decision turns to 4-vs-1 fire, before it can land more than a couple of anti-tank rockets), THEN mop up the chaff. Engaging the chaff first — via attack_move auto-targeting the nearer rifle screen, or by explicitly attacking the cheap infantry — leaves the three rocket soldiers firing through the entire mop-up and they whittle the squad below the survival floor. A brute attack-move play and a kill-chaff-first play both LOSE; the threat-first focus play WINS.	Military target prioritization under fire — when engaging a mixed force, the high-threat asset (anti-armour weapon, AA battery, command vehicle) is neutralised FIRST, even when it is not the nearest target, because it out-damages the cheap escort. Killing the low-threat chaff first lets the priority threat keep firing and attrits the friendly force. Threat-weighted engagement — not nearest-first — preserves the squad.	SC2 focus-fire target priority, threat-weighted engagement, military target prioritization
combat-vehicle-vs-infantry-counter	Hard-Counter Selection — Tanks vs Rockets vs an Infantry Threat	reasoning	rush-hour-arena	A fixed cash budget funds EITHER a small armoured fist (3× 2tnk medium tanks @ $850) OR a mass anti-tank rocket squad (8× e3 rocket soldiers @ $300) OR a rifle company (up to 25× e1 @ $100). The enemy is PURE rifle infantry (e1 mass entrenched at centre). The correct rock-paper-scissors counter is armour: heavy tanks soak small-arms fire and shred soft targets at range. Rockets are the WRONG counter — anti-tank munitions against soft infantry waste cost-per-effect and the rocket squad's short stand-off + low HP gets out-DPSed by the rifle mass on attrition. Matching the enemy with own rifles is a 1:1 trade with no positional advantage and loses. Stalling never reaches the kill bar. The capability under test is hard-counter SELECTION: scout the threat composition, infer it has no anti-armour, and commit the WHOLE budget to the dominant counter.	Capability counter-selection in defense procurement / fleet composition: buying anti-tank guided missiles is wrong against a soft-target infantry threat (cost-per-effect waste), buying rifle squads matches the threat 1:1 with no advantage, buying armoured vehicles is the dominant choice. The decision under test is reading the threat profile and committing the whole capital budget to the dominant counter rather than hedging across categories.	SC2 hard-counter, military RPS counter, capability-based defense
coord-converge-on-target	Convergent Attack — Three Squads, One Defended Target	action	rush-hour-arena	Three armoured columns start on three different bearings (NORTH, WEST-of-objective, SOUTH) and must converge on a single defended enemy construction yard from three different directions. The win requires every tank of all three columns to reach the objective region with the yard destroyed: a single-column tour delivers only a third of the force, a serialized two-column attempt only two thirds — neither reaches the joint-arrival threshold. The advertised capability is synchronous multi-fleet convergence on a shared objective; a stall loses on the clock.	Synchronous multi-robot rendezvous on a contested objective: each team must dispatch on a different ingress vector so that all teams arrive at the goal; dispatching only one or two teams leaves the joint payload threshold (≥n teammates on the goal) unmet and the objective unfulfilled.	SC2 triple-prong assault timing, military convergent attack / pincer movement, SMAC squad convergence, envelopment doctrine: hit from multiple directions simultaneously
coord-cover-and-move	Bounding Overwatch — One Squad Covers While the Other Moves	action	rush-hour-arena	Two armoured squads must cross a centre-of-map FIRE ZONE held by static anti-tank infantry. Charging both squads through together stacks every enemy's fire onto a single dense column and busts the attrition cap. Sending one squad alone leaves the other idle and the lone column still absorbs ALL the cluster's fire. The intended capability is bounding overwatch: one squad stops just OUTSIDE the cluster's range and acts as the COVER team — firing on the cluster to draw its attention from the closer flank — while the BOUNDING team sweeps WIDE through the periphery (outside enemy sight) to a forward position; then the roles alternate so the cover team can also cross safely. The cluster's fire is split per phase rather than stacked on one column, so each squad takes at most one loss.	Multi-agent traversal of a contested corridor where defenders have a fixed effective range envelope and concentrate fire on whichever cluster is closest. Single-team traversal eats the full defensive fire envelope; the joint policy alternates an OVERWATCH role (suppressive fire drawing defender attention from a flank) with a BOUNDING role (sweep through the cleared periphery). The coordination is the role-alternation — leapfrogging cover and move — not just splitting the force.	US Army FM 3-21.8 bounding overwatch (fire-and-maneuver doctrine), SC2 siege-tank leapfrog advance under enemy fire, SMAC coordinated cross of a danger zone with split defender fire, MicroRTS multi-squad alternating cover-and-move
coord-diversionary-attack	Diversionary Attack — A Diverts South, B Razes the Real Target North	reasoning	rush-hour-arena	Two squads command a split-attack against an enemy that holds TWO key buildings: a REAL target (construction yard) with light defence and a DECOY (power plant) with heavier defence. The scripted `guard` enemy holds post and lunges only when baited into its aggro radius. Driving each squad onto its nearest visible target is a trap: the closer-stronger squad (tanks) commits onto the decoy's heavy anti-tank garrison (gets shredded AND razes the wrong building — only a `fact` in the target region scores), the closer-weaker squad (jeeps) can't crack the real target's garrison fast enough, the clock expires. The winning policy SWAPS the assignment: commit the fast disposable jeeps south-east into the decoy's heavier defence to bait its garrison OFF the fact line; commit the tanks north-east through the now-thin fact perimeter to raze the construction yard before the surviving light garrison can trade out the column. Phase 1 (the diversion) yields zero objective credit — only phase 2, after the south defenders have committed south in pursuit, scores. The credit problem is DIVERSIONARY ASSAULT: a no-reward enabling action (drag the heavier defence away from the real target with a disposable lure squad) must precede and temporally overlap the rewarded objective action (commit the main strike on the briefly-undefended real target).	Concurrent multi-robot dispatch with deception: two independent teams must coordinate against a reactive adversary, where one team takes a NO-REWARD enabling role (drawing the adversary's attention to a decoy target on the OPPOSITE flank) while the other team — counter-intuitively assigned to the FARTHER objective — completes the rewarded task. The lazy nearest-task assignment is degenerate: every team must commit ACROSS the map to the objective its sibling enabled.	SC2 multi-prong / split-attack assault, military diversionary tactics (Sun Tzu loud feint + quiet main thrust), CICERO / Diplomacy deception, advertising: loud feint draws competitor attention while quiet launch hits real market
coord-mutual-support	Mutual Support — Advance the Squad as a Tight Ball	action	rush-hour-arena	A six-tank squad at the west edge must advance east, punch through a belt of anti-tank harasser clusters, converge on an enemy cluster near the centre, destroy the defenders, and finish with at least five of the six tanks alive. Advancing strung-out — a single-file column spread across many cells — feeds each harasser cluster one or two lead tanks at a time; the cluster concentrates its fire and destroys the isolated leaders before the rest of the squad closes up, and the force is defeated in detail. The winning play is to advance as a TIGHT BALL: keep all six tanks inside a small clump so the whole squad enters each cluster's fire envelope together, every tank's cannon concentrates on the cluster, and the defenders are erased in one or two volleys before they can finish any single tank. Mutual support means no element ever advances outside the supporting range of the others.	Multi-agent coherent advance under fire: a defender concentrates its fire on whichever agent is closest and within its weapon envelope. A strung-out formation lets the defender pick off the leading agents one at a time (one-sided focus fire); a coherent clump brings the whole team's firepower to bear at once (reciprocal focus fire) and out-trades the defender. The decision under test is formation cohesion — keep the squad inside mutual-support range throughout the advance, regrouping before each contact rather than racing the fastest units ahead.	military mutual support, SC2 ball micro, SMAC squad coherence, defeat-in-detail avoidance: keep the force concentrated
coord-relay-attack	Relay Strike — Heavy Tanks Soften, Infantry Mop Up	action	rush-hour-arena	Two-wave relay assault: the first wave (heavy tanks) commits to destroy the enemy armour line; the second wave (light infantry) follows up to clear the remaining enemy infantry. Sending the light second wave in first exposes the soft infantry to the un-suppressed enemy tank line and the wave is shredded before any kills materialise; sending both waves together exposes the light wave to the same tank fire envelope before the heavy wave can suppress them. Only the relay ordering (A engages → softens → THEN B advances) preserves the force AND finishes inside the deadline.	Two-team relay clearing operation: the suppression team (heavily armoured) commits first to neutralise the heavy threats; the mop-up team (lightly armoured, fast, optimised for the secondary target class) advances only after the heavies are down. Concurrent commitment over-exposes the mop-up team to the heavies they are not equipped for; the suppression-then-clear handoff is the doctrine.	SC2 attack-wave timing, SMAC relay strike doctrine, military overlapping fires, fire-and-maneuver: bound-and-bound
coord-relay-vision-chain	Vision Relay Chain — Space Scouts to Keep the Far Objective Observed	action	rush-hour-arena	To keep a distant objective continuously observed, scouts must be spaced in a CHAIN — one per leg of the corridor, each within vision range of the next — so coverage propagates from the base out to the far objective. Sending a single scout all the way forward leaves the corridor uncovered behind it; bunching the scouts together gives reach at one spot but no depth. Only the distributed relay chain holds the whole corridor and the far objective at once.	Multi-robot sensor-network coverage / communications relay: a swarm tasked with maintaining situational awareness of a remote site cannot send one robot out (it loses the relay back to the operator) nor keep the swarm clustered (it cannot reach). The robots must space themselves into a relay chain so each link bridges to the next and the network spans base to objective — the classic line-of-sight relay / mesh-coverage problem.	military relay chain, sensor-network coverage, communications relay, SC2 scout-line map control
coord-squad-handoff	Squad Handoff — Squad A Delivers, Squad B Takes Over (Wave-2 then:)	action	rush-hour-arena	Sequenced multi-team operations: team A delivers the first objective, team B takes over for the next, and so on — the order of arrival matters, and a single team cannot serve both legs because each leg requires a SPECIFIC kind of unit. Relay race (baton handoff), construction site handoff (steelworker phase finishes, electrician phase begins), and supply-chain ordered handoff are the everyday analogues.	Multi-robot fleet with role specialisation and sequenced subtasks: the survey drone must reach landmark A before the delivery truck is dispatched to landmark B (the delivery is meaningless without the survey). One fleet cannot do both legs because the drone and the truck are physically different vehicles for different jobs.	Watch-And-Help concurrent multi-agent handoff, SMAC squad sequencing, military passage of lines doctrine, relay race / construction handoff
coordination-ordered-rendezvous	Ordered Rendezvous — Deliver Squads to Waypoints In Sequence	action	rush-hour-arena	Parallel multi-team delivery under an ordering constraint: waypoint A must be reached before waypoint B before waypoint C, each requiring two or more units present, all within one tight overall window. Skipping or out-of-order delivery does not count. Single-column tours blow the deadline; only concurrent dispatch with leg-by-leg ordering succeeds.	Sequenced multi-depot fleet dispatch with a precedence constraint: team A must arrive at depot 1 before team B may be credited for arriving at depot 2 (e.g. handoff / supply chain), and the overall window does not permit a single team to do all legs in series.	Watch-And-Help concurrent multi-agent with sequenced sub-tasks, SMAC asymmetric squad sequencing, PERT/CPM precedence-constrained dispatch, supply-chain ordered handoff: depot A before depot B
coordination-staggered-window	Staggered Windows — Two Docks, Two Deadlines, One Fleet	action	rush-hour-arena	Warehouse fulfilment with non-aligned per-order SLAs and one packing line per dock: every dock must be staffed during its own shipping window. A single packer touring one dock then the other (the single-column tour) cannot honour both SLAs — the first dock empties while the packer relocates. The advertised capability is parallel multi-fleet dispatch with hold-on-station, not route-planning.	Multi-robot dispatch where each robot team has a job-specific deadline that does not align with the others': teams must launch staggered (the long-haul team first) so that every team is on its own objective during a bounded common arrival window. Serialising one team behind the other blows the global SLA.	Watch-And-Help concurrent multi-agent, SMAC asymmetric squad timing, MARLÖ multi-objective deadlines, multi-robot dispatch with non-aligned SLAs
custom-map-no-enemy	Pure Navigation — Commit a Route and Reach the Goal Zone Before the Clock (No Enemy)	perception	singles-maginot	Pure spatial navigation inside a confined custom region with no adversary: read the map, commit a route through the bounded playable area, and drive the force to the designated zone BEFORE a tight, reachable deadline. There is no combat — discrimination comes only from perceiving the target and committing decisively. Idling, dithering, or wandering toward the nearest unexplored cell never reaches the zone in time and LOSES on the clock.	Confined-space autonomous navigation under a deadline (warehouse aisle, indoor map): reach a goal cell within a bounded region from a map read alone, fast enough that hesitation fails the mission.
def-bridge-chokepoint	Bridge Chokepoint — defend every crossing before they reach you	action	rush-hour-arena	Geographic chokepoint defense. A barrier (water) channels every attacker through a small number of crossings (bridges). The defender must (a) flip the engagement stance so units fire, AND (b) distribute forces across every crossing — concentrating on one bridge leaves the others wide open and their assets fall.	Perimeter defense around a facility with a moat and discrete gates. Guards posted on standby must be told to engage AND stationed at EACH gate — an undefended gate is a free corridor for the intruder to reach the nearest asset.	ERQA spatial commit (read terrain → assign defenders to chokepoints), MicroRTS chokepoint defense, military doctrine: defense at obstacle-channelized terrain
def-counter-battery	Counter-Battery — Kill the Artillery FIRST	reasoning	generated	A frontline of enemy rifle infantry screens two or three long-range artillery pieces posted in the rear. The artillery out-ranges the base pillboxes and shells the construction yard from a standoff the static defences cannot answer. The mobile tank force must perform a counter-battery strike: drive past the infantry screen and destroy the artillery FIRST, before it razes the construction yard. Trading fire with the screening infantry while the artillery keeps firing loses the base — the highest-impact threat must be engaged first, not the nearest one.	Military counter-battery doctrine and threat prioritisation: when a hostile asset is degrading your objective from beyond your defences' reach, the correct response is a targeted strike that neutralises that asset first, even though a nearer, louder threat is competing for attention. Spending the response on the screening force while the real threat keeps operating is the canonical prioritisation failure — the same shape as a SC2 siege-tank counter, where a mobile force must close on and kill the out-ranging siege line rather than trade with its escort.	military counter-battery doctrine, threat prioritization, SC2 siege-tank counter
def-engineer-repair-under-fire	Engineer Repair Under Fire — Triage the Damaged Proc While Engaging	action	rush-hour-arena	Multiple of your structures are under sustained attrition from a concentrated grenadier-led rusher band, with the refinery (proc) — the most-critical, income-bearing building — exposed at the front of the base. You must keep the proc alive by engaging autorepair (`repair`) the moment damage starts AND commit the pre-placed defenders (hold-fire on spawn) to kill the attackers. Stalling lets the proc die; engaging without repair lets cumulative grenade damage drop the proc before the band is cleared; toggling repair without offensive output never clears the attackers and busts the kill bar / clock. The intended triage is: prioritise the most-damaged AND most-critical building (the proc) for repair, while the defenders do the killing.	Disaster-recovery / SRE triage: a sustained incident damages multiple services and the operator must (a) identify the most-critical-and-most-damaged subsystem (the income-bearing refinery here, the production database in SRE), (b) engage the repair organ on it WHILE (c) the on-call team contains the attacker. Stalling loses the critical subsystem; pure-engage / pure-repair single-axis plays both bust the SLA.	disaster recovery, SC2 SCV repair, repair-order triage
def-evacuation	Evacuate the Doomed Base — Retreat East to the Safe Zone	reasoning	rush-hour-arena	The primary base is being overrun by a heavy assault that structurally cannot be repelled with the units on hand — the fact and proc WILL fall regardless of what the agent does. The correct play is not to fight the doomed-defence (sunk cost) but to EVACUATE the remaining mobile force EAST to a pre-designated safe zone before attrition reduces the column below the survival floor. Win = ≥3 of the 5 starting tanks inside the safe-zone radius, ≥3 still alive, before the deadline. Stalling loses (the assault wipes the immobile tanks at the base); holding loses (the heavy + rocket composition out-trades the 2tnk column); only the prompt EVAC-east order wins.	Business-Continuity-Planning (BCP) evacuation and NFPA / FEMA emergency-management EVAC doctrine — when shelter-in-place is no longer survivable, the policy must abandon the compromised site and route to the pre-designated muster point. Same pattern in cloud-incident response (drain a region under irrecoverable outage to a healthy peer), in wildfire and chemical-spill EVAC, and in SC2 GG-walk macro micro (preserve the army when the base cannot be saved).	BCP evacuation (business continuity planning), emergency management EVAC (NFPA / FEMA muster-point doctrine), SC2 retreat / GG-walk: preserve army when base is doomed, cloud incident region-drain to healthy peer
def-in-depth-vs-single	Defense in Depth — Two Bands Beat One Thick Wall	reasoning	rush-hour-arena	A heavy enemy wave drives straight at your construction yard. The same finite pillbox budget massed into ONE thick wall has no depth: when the wave punches a hole, the survivors sail through the breach and raze the base. The SAME pillboxes split into a FRONT line (which absorbs and thins the first impact) plus a REAR line at greater depth (which shreds the survivors that break through) contains the breach instead of suffering it. Same resources, different topology, qualitatively different survivability. The win predicate makes the topology decision load-bearing: total pillbox count alone is not enough — the pillboxes must be split across two non-overlapping depth bands, the construction yard must survive, and the wave must actually be destroyed.	Layered security architecture (defense-in-depth): a single perimeter firewall, however thick, eventually gets breached; the same defensive capacity deployed as an outer layer plus an inner layer absorbs the breach — a single component failure is contained, not catastrophic. The same logic as a medieval keep behind an outer wall (the attacker must take BOTH), or redundant systems architecture where one failure does not bring down the system.	military defense-in-depth, security layered defense, MicroRTS defense
def-in-depth	Defense in Depth — Two Layers, Not One Thick Wall	reasoning	rush-hour-arena	Defense-in-depth doctrine: against a massed concentrated rush, a front line plus a second line at greater depth shreds the attacker after the first layer falls, where a single fortified line — even one with the same brick count — eventually collapses under sustained pressure and razes the base behind it. Picking concentration vs distribution is the reasoning load: the same four pillboxes massed in one cluster lose, the same four pillboxes spread thin in one long line lose, and only a 2+2 layered topology survives.	Multi-layer security architecture (defense-in-depth): a single perimeter firewall, however thick, eventually gets breached; the same defensive capacity deployed as outer + inner layers (each covering the same arc) absorbs the breach. Same total resources, different topology, qualitatively different survivability. Also: medieval keep + outer-wall design (attackers must take BOTH); redundant systems architecture (one component failure ≠ system failure).	military defense-in-depth doctrine, security multi-layer architecture (defense-in-depth), medieval keep + outer wall design, redundant systems architecture
def-multi-direction	Distributed Defense — Split Across Concurrent Lanes	reasoning	rush-hour-arena	Enemy waves attack the centrally-placed base concurrently from multiple cardinal directions. Distributing defenders evenly across each active lane holds every approach; concentrating on one lane leaves the other lanes uncontested — those waves stride untouched to the construction yard and raze it. This is the canonical multi-front / graph min-cut allocation problem: defensive capacity must be matched to the cardinality of concurrent threats, not to the loudest single threat.	Distributed-systems load balancing under N concurrent request streams: each stream needs at least one worker, or its queue builds without bound and the SLA breaks regardless of how well the other streams are served. Likewise air-defence sector coverage: pooling all batteries in one sector leaves the others open. Same total capacity, different topology, qualitatively different system survivability.	distributed-systems load balancing, graph min-cut, military multi-front
def-position-expected-direction	Fortify the Threat Axis — Pre-Position Defence on the Intel Lane	reasoning	def-pos-expected-east-easy-arena	Spatial commitment of a finite defence budget to the CORRECT direction given declared intel. The mission brief states where the enemy will attack from; the agent must invest its pillboxes on that lane (and not the opposite lane). A correct-axis commitment blunts the rush; a wrong-axis commitment lets the rushers walk past and the run loses on the spatial clause. This is the decision drill of pre-positioned defence: not "should I defend?" but "WHERE should I defend, given what intel says?".	Intel-driven facility hardening: a security operator is told the threat axis (a perimeter breach is expected on the EAST gate) and must commit the available barriers / sensors / responders to that axis. Spending the same budget on the WEST gate satisfies no objective. This pack tests whether the policy converts a textual directional cue into a CORRECT spatial commitment.	ERQA spatial commit, MicroRTS terrain-aware defense, military pre-positioned defense, intel-driven facility hardening
def-position-revealed-direction	Scout the Axis, Then Fortify — Intel-Driven Adaptive Defense	reasoning	generated	Adaptive defense under uncertainty: the threat axis is NOT known a priori; the agent must drive its scouts to discover the enemy's forward outpost (where the build-up actually is), then commit pillboxes on THAT lane, then engage. Pre-committing to a single direction without scouting is the classical Maginot-Line failure mode — defenses on the wrong axis are dead weight while the unguarded lane is overrun. The capability under test is the intel-action loop: observation FIRST, commitment SECOND.	Intel-driven facility hardening / military reactive defense: a perimeter operator does not know which gate will be breached; they must dispatch reconnaissance, confirm the threat axis from sensor returns, and then deploy barriers / responders on the confirmed lane. Skipping the reconnaissance step and pre-staging on a guessed lane is correct half the time and catastrophically wrong the other half. The robust doctrine is scout-then-fortify, not commit-on-faith.	PlanBench replanning, military reactive defense, intel-driven defense
def-pre-position-mobile-reserve	Central Reserve — Stage Mobile Force at Centre, Move Out to Intercept	reasoning	rush-hour-arena	A rush is incoming, but the flank it will arrive from is uncertain at decision time. A small mobile armoured reserve is pre-staged at the centre of the engagement zone — equidistant from both candidate flanks. The reserve MUST move OUT of the centre toward the localised threat lane(s) to intercept the rush before it reaches the construction yard. Pre-committing to one flank wins only when the rush arrives on that flank and LOSES otherwise (the reserve cannot pivot back in time); holding passively at the centre lets the auto-engagement kill some attackers but satisfies no forward-zone objective. The canonical answer is centralisation + forward commitment — military "general reserve" / chess centralization doctrine.	Quick-reaction force / on-call incident responder: the responder is stationed at a hub equidistant from candidate incident sites, not committed to any one site in advance. When an alert localises the incident, the responder DEPLOYS FORWARD to that site — staying at the hub or committing in advance to the wrong site both fail the response SLA.	military reserve doctrine (central reserve responds to any flank), chess-piece centralization principle, rapid-response force doctrine, MicroRTS reactive defense from centre
def-reinforce-the-breach	Reactive Defense — Reinforce the Breached Lane	action	rush-hour-arena	A multi-lane defence with a mobile armoured reserve held uncommitted at the centre. At first both lanes face only a light probe, but mid-episode one lane is BREACHED by a heavier wave. The light garrison there cannot hold the breach alone; the agent must react — read which lane was breached — and shift the mobile reserve FORWARD into that lane to reinforce it before the line collapses. Holding the reserve at the centre, or committing it to the quiet lane, lets the breach overrun the line. The pack tests reactive reserve commitment: identify the developing breach and commit the reserve to the sector that is actually being hit.	Incident surge response: an on-call response team is staged centrally, equidistant from candidate sites. When one site escalates into a breach, the surge capacity must be committed to THAT site — promptly and in force. Holding the reserve back or dispatching it to a quiet site both fail the response: the skill is reading where the main effort has localised and committing the reserve there.	military reserve commitment, reinforce the breach doctrine, incident surge response
def-retreat-and-rebuild	Strategic Withdrawal — Retreat from the Forward Base and Rebuild at Depth	reasoning	rush-hour-arena	The forward base is being overrun by a heavy attack that cannot be repulsed inside the tick / cash budget; the right move is to ACCEPT THE LOSS of the forward position, fall back to a deeper safe zone (pre-staged with an MCV and a power plant), DEPLOY the MCV to spawn a new Construction Yard there, and BUILD a refinery so production stands up at depth. Trying to hold the forward base bleeds the defenders and the deadline expires with neither base alive. Stalling, hold-forward, or deploy-only-no-rebuild all lose: only the recognise-and-withdraw policy reconstitutes a live fact + proc post-retreat.	Business-continuity failover and military strategic withdrawal: the primary production site is being lost and heroic defence is infeasible; standing up the cold-standby site (DR site, backup datacentre, fall-back command post) BEFORE the primary fully falls is the goal, not preserving the primary. The policy must read the situation as "lost cause, withdraw" and redirect resources to the standby instead of doubling down.	strategic withdrawal / CICERO retreat doctrine, PlanBench replanning under exogenous loss, business continuity / disaster recovery failover, military trade-space-for-time fall-back
def-stance-mgmt-hold-then-attack	Ambush Trigger Discipline — Lift Stance to Engage the Inbound Rush	action	rush-hour-arena	A defended-overwatch / ambush trigger drill: four medium tanks are pre-staged at a choke between the inbound enemy column and the construction yard. Their standing orders are HoldFire (stance:0) — they will not fire on their own. The mission is to call `set_stance` to lift the engagement authority so all four tanks auto-engage the rush at cannon range and wipe it before it reaches the base. Never flipping the stance leaves the position functionally undefended (the rush walks through unopposed and razes the base); the controllable verb that turns the pre-staged position into an active defence is the stance flip.	A weapon-system safety interlock: a turret / drone-swarm perimeter is pre-staged and aimed at the avenue of approach, but the trigger authority is held by the operator. The operator MUST issue the engagement-authority command for the system to fire on the inbound target; without that command the perimeter is functionally a passive observation post and the target reaches the protected asset. The skill is the operator's decision to LIFT the engagement-authority lock at all — not the precise moment within the engagement window.	military rules-of-engagement (ROE) ambush doctrine, SC2 stance micro — hold-then-attack siege-tank salvo, ambush doctrine — engagement authority lifted on trigger
def-surprise-flank-react	Surprise Flank — Intel Said North, Enemy Came South: Detect and Redeploy	reasoning	generated	The mission brief states the enemy will attack from the NORTH and the terrain reinforces it: a rocky band on the north edge forms a natural chokepoint, with two pillboxes pre-built to cover it. The four pre-placed defenders are distributed around the construction yard — not yet committed to either lane. MID-EPISODE a rusher band actually arrives from the SOUTH (the open flank) via a scheduled spawn. The decision is whether the agent trusts the announced direction + terrain cue and watches its construction yard get razed, or detects the actual attack axis from the spawn event, leaves the wrong-axis pillboxes silent, and commits the defenders SOUTH to intercept the real rush before it reaches the fact.	Misdirection-aware defence / military adversarial robustness on a directional axis: a fortified flank that the enemy has bypassed must be ABANDONED in favour of the actual contact axis, even though doing so contradicts the announced plan and "wastes" the pre-built fortification. The classical failure mode this trains against is "fighting the plan, not the enemy": the operator sticks to the announced threat axis even after the ground-truth signal contradicts it. Robust security operations require that the response policy treat announced threat intelligence as a PRIOR, not a guarantee, and commit to the observed axis when the two diverge.	adversarial robustness / surprise-axis reaction, opponent-modeling: detect deception, military reactive defense / flank discipline, CICERO deception handling
def-tower-line-vs-cluster	Defense Topology — Match Towers to Threat Geometry (Line or Cluster)	reasoning	rush-hour-arena	Defensive structure placement is geometry-sensitive: a wide-front attack across many rows demands a LINE that covers each row; a concentrated thrust through one chokepoint demands a CLUSTER of overlapping fire at the choke. Both topologies waste budget on the wrong geometry — a cluster against a spread attack leaves the flanks open; a line against a funnel under-defends the choke. The capability is reading the attacker's forcing geometry from the map and the threat composition, then committing the defense topology that matches.	Firewall / WAF rule placement. When traffic enters across many paths (broad ingress, microservices), the right architecture is a rule per path — a layered spread. When all traffic MUST traverse a single ingress (one API gateway, one bridge), the right architecture is dense layered inspection AT that point. A defender who deploys the wrong topology for the wrong threat surface either overspends on one path and leaves others open, or under-protects the actual ingress.	graph theory min-cut (concentrate defenses at chokepoints), military bunker placement doctrine, firewall topology: dense at chokepoints, perimeter doctrine (cover the full approach width), Lanchester defense concentration

action-multiunit-coordination

Multi-Squad Coordination — Drive Several Groups to Different Objectives at Once

action

rush-hour-arena

Multi-region task allocation under a shared deadline is not a planning problem here — the split is obvious. It tests whether the controller can actually drive several effector groups in parallel instead of completing one objective then the next, which serialized control fails to do in time.

Coordinated multi-robot fleet dispatch: a logistics swarm must place distinct sub-teams at separate depots within one delivery window; one-at-a-time control blows the schedule.

action-sequenced-execution

Non-Stop Ordered Route Execution — Run Given Waypoints In Order Without Stalling

action

rush-hour-arena

The route is already planned; the open problem is faithful, non-stalling execution of an ORDERED objective sequence — reaching each waypoint in turn and issuing the next move the instant a leg completes, rather than idling, re-deliberating, skipping a waypoint, or rushing straight to the end.

Manipulator / mobile-base task sequencing: a robot must hit a fixed ORDERED set of stations (pick → transit → place) within a cycle-time budget; pausing, reordering, or skipping a station misses the takt.

adv-asymmetric-weaker-must-win

Asymmetric Underdog — 2 Mediums Must Beat a Stronger Garrison via Flank-Pick

reasoning

generated

Two medium tanks must inflict enough kills on a stronger enemy garrison (a wall of rifle infantry plus a heavy tank on medium and hard) to satisfy a kill bar — without losing the pair. Head-on combat is a decisive LOSS: the heavy out-trades the medium column at close range and the infantry wall holds the pair in place long enough for the heavy to close. The only winning play is asymmetric: stage off-axis, approach the infantry on a flank that keeps the mediums outside the heavy's aggro envelope, pick off the infantry one at a time from the flank, and retreat past the heavy's leash any time it lunges. The skill measured is weaker-side reasoning: which sub-target to commit to (the soft infantry, not the hard heavy), which approach vector keeps the column outside the strongest threat's fire envelope, and when to disengage before being fixed.

Asymmetric / guerrilla doctrine for a force-on-force multi-robot encounter: a numerically and qualitatively weaker team must refuse decisive engagement with the strongest opposing asset, exploit terrain and mobility advantages, and pick off weaker support assets piecemeal where the local force ratio is favourable. The decision under test is approach-vector planning (stay outside the strongest threat's effective radius) and sub-target prioritisation (engage soft targets, not hard ones).

SC2 asymmetric, asymmetric warfare, guerrilla tactics

adv-rps-counter-pick

Scout-then-Counter — Hard Counter Selection Across Seeded Enemy Compositions

reasoning

generated

A FIXED budget funds EITHER an armoured fist (medium tanks 2tnk), OR a rocket squad (e3 anti-tank), OR a rifle company (e1 mass). The enemy archetype ROTATES per seed across four compositions (rifle swarm / heavy tank column / rocket cluster / second rifle swarm), each demanding a DIFFERENT counter. Pre-committing to one composition LOSES on at least one seed. The capability under test is the observation-driven counter SELECTION loop: scout the enemy with the starter jeeps, infer the threat profile, then commit the whole budget to the matching counter. There is no dominant single build — the right answer depends on what the scout reveals.

Threat-driven procurement under uncertainty: a fixed capital budget funds anti-armour platforms OR anti-infantry platforms OR foot soldiers; telemetry from forward scouts must confirm the threat profile BEFORE the commit. Pre-committing to an anti-tank loadout against an infantry threat (or vice-versa) is a cost-per-effect waste that loses the engagement. The decision under test is the read-then-commit loop, not a single memorised build.

SC2 hard-counter / scout-and-pivot doctrine, game-theoretic best response under composition uncertainty, military RPS (tank-vs-infantry / infantry-vs-rocket / rocket-vs-tank), capability-based defense procurement under uncertainty, PlanBench replanning on new observation

adversarial-duel

1v1 Combat Duel — Beat a Reactive Enemy at Close Range

adversarial

generated

A close-quarters force-on-force fight against an enemy that shoots back: the decision is combat micro — concentrate fire on one target, retreat damaged units, trade favourably. The enemy starts just to the east, in or near sight, so this is a DUEL, not a search (finding scattered enemies on a big map is tested separately). Difficulty escalates the opponent, not the search.

Adversarial multi-agent engagement at close range: prevail over a reactive opponent team by target selection and damage trading.

artofwar-indirect-approach

The Long Way Round — Take a Costly Detour Because the Direct Path Is Lethal

reasoning

generated

迂直之计 — "make the devious route the most direct." A LEASHED GUARD line (engine scripted bot 'guard') walls the short west→east lane to the objective: every guard auto-fires in range, lunges at any unit that comes within its aggro radius, and snaps back to post. Driving the column straight down that lane gets the whole force shot apart in a turn. The only winning policy abandons the short axis entirely — climb to the open flank, run the long way around the END of the wall, then turn in — a route whose progress toward the objective stays flat or dips for ~20 turns before it pays off. Tests temporal credit assignment over a long horizon against a greedy shortest-path / "just charge it" bias; idling never commits and loses on the clock.

Hazard-aware long-horizon routing: reject the locally optimal short path through a lethal interdiction zone for a circuitous survivable one (around the obstacle's flank) whose payoff is far delayed.

artofwar-lure-the-tiger

Lure the Tiger Off the Mountain — Pull the Strong Defender Off the Lane, Then Run the Main Force Past It

reasoning

generated

调虎离山 — "lure the tiger off the mountain." A STRONG leashed GUARD line (engine scripted bot 'guard') walls the only lane to the objective: every guard holds its post, auto-fires anything in range, lunges at the nearest foe within its aggro radius, and snaps straight back to its post the instant nothing threatens it. Driving the durable main body straight down the lane runs it into the full line (anti-tank rockets) and it is shot apart before it gets through. The only winning policy commits the fast bait on a DIVERGENT vector that comes just close enough to the line to make a segment of it lunge OUT after the bait — off the lane, toward its leash — then, in that transient window while the tiger is off the mountain, runs the main body through the briefly-open slot to the far objective before the line snaps back. Phase 1 (pulling the tiger off) yields ZERO objective progress and looks locally negative (the bait draws all the fire); only phase 2, through the vacated slot, scores. The distinction from decoy-sacrifice: here the credit problem is DISPLACE-AND-EXPLOIT — the bait must lure the defender off and the main force must exploit the *reversible* vacancy in time; preserving the force is what is rewarded, not spending a sacrificial decoy. Greedy "just charge the lane" and "do nothing" both fail; a bait that never gets close enough never displaces the tiger.

Two-phase manipulation against a reactive guarding obstacle: a no-reward enabling action (entice a leashed interceptor off the only corridor with a decoy on a divergent vector) must precede and temporally overlap the rewarded objective action (run the payload through the briefly-cleared corridor before the interceptor snaps back).

artofwar-sequenced-citadel

Staged Assault — Reach the Waypoints In Order, Then Seize the Citadel

reasoning

generated

攻其無備 — strike where unprepared, but only after the prerequisite moves. The mission is a STRICT ordered sub-goal chain: stage at A, then transit B, then seize the citadel C — and only after a deliberate hold (the strike is timed, not rushed). Reward lands only at C; A and B are unrewarded prerequisites whose ORDER is enforced (reaching C without having passed A then B counts for nothing). Long-horizon sub-goal sequencing with delayed terminal credit; a greedy beeline to C, or idling, fails.

Ordered multi-waypoint mission (stage → transit → objective) where only terminal success is rewarded, step order is hard-enforced (not merely graded), and the terminal action must wait out a hold — the Blocksworld/GAIA-style long-horizon plan.

build-defensive-skirt-corners

Defense Topology — Skirt the Building with a Pillbox in Every Corner

reasoning

generated

A single high-value building (the fact — your construction yard) sits at the centre of the map and is rushed CONCURRENTLY from all four diagonal corners (NE, NW, SE, SW). Right doctrine: a SKIRT of pillboxes — one planted in EACH of the four corner approaches — so every diagonal axis has its own overlapping field of fire. Wrong doctrine: massing every pillbox on a single corner; that satisfies a naive "we built four defences" count and holds one corner, but the three uncovered corner waves stride untouched into the fact and raze it. The win predicate makes the topology load-bearing: total pbox count is not enough — one pbox must sit inside a radius-4 disc in EVERY corner region, the pillboxes must KILL the rush, AND the fact must survive.

Distributed defense / quadrant coverage: when one central asset is threatened from all bearings, finite defensive capacity must be matched to the cardinality of the threat axes — a strongpoint in every quadrant — not massed on the loudest single approach. The same logic as air-defence sector coverage or a multi-front perimeter: an uncovered sector is an open envelopment lane no matter how heavily the others are defended.

MicroRTS pillbox placement, distributed defense, military quadrant doctrine

build-defensive-tower-cluster

Defense Topology — Cluster Pillboxes Around the High-Value Building

reasoning

generated

A single high-value building (the fact — your construction yard) is the protected asset, and a concentrated enemy band funnelled through a single narrow corridor charges directly at it. Right doctrine: a TIGHT CLUSTER of pillboxes WRAPPING the fact, overlapping fields of fire on the single point that matters. Wrong doctrine: a thin LINE of pillboxes strung between the fact and the corridor mouth — the line satisfies a naive "we built N defences" count but cannot stop a focused rush from reaching the fact, because the firepower is too dispersed to mass on the threat. The win predicate makes the topology decision load-bearing: total pbox count alone is not enough; ≥3 of the pillboxes must sit INSIDE the radius-4 disc around the fact, AND the fact must survive.

Critical-asset protection / defense-in-depth: when ONE node is the irreplaceable principal (the fortress keep, the SAM hub, the data-centre hardened core, the ambassador's residence), the right architecture is dense overlapping enforcement AT that asset, not a thin uniform perimeter that is everywhere but never massed where the threat actually comes. The same logic that gives you a single hardened crypto module behind layered firewalls — concentrate coverage AT the asset, not a uniform spread.

ERQA, military strongpoint, asset protection, defense-in-depth around critical infrastructure

build-defensive-tower-line

Build a Defensive Tower LINE Across a WIDE Front (Not a Cluster, Not a Scatter)

reasoning

rush-hour-arena

Where do you commit your defensive towers when the threat is a rush spread across the FULL WIDTH of the map — not pinched through a single corridor cell, but advancing on every row of a wide front? Military perimeter doctrine and firewall rule design both say: cover EVERY lane across the front, one post per row, so no enemy unit can slip past on an unguarded row. A single dense cluster on one row wastes overlapping fire on one cell while every other row stays open; a scatter near the base never engages the rush at all. The win predicate makes the LINE topology load-bearing — total pillbox count alone is not enough; ≥1 pillbox must sit on EACH of the front's rung rows (cell-exact via radius 0.5), AND those pillboxes must actually KILL the rush spread across the front.

Network firewall / Web Application Firewall rule placement: when every protocol/port could be the path of compromise, the right architecture is one rule per port across the full inspection surface, not three duplicated rules on one port while the rest stay open. Likewise a physical perimeter patrol covers EVERY approach lane across the front — a cluster at one waypoint or a scatter across unrelated nodes both leave the actual front lanes traversable. Defense in depth across a wide approach demands one responder per lane, not many responders at one waypoint.

ERQA, MicroRTS defense, military perimeter

build-engineer-rebuild-after-loss

Build-Engineer: Rebuild Power Plant After Mid-Episode Loss

reasoning

rush-hour-arena

Reactive replan after exogenous loss of a critical-infrastructure building (the Power Plant) mid-episode. An ongoing operation has its power-grid building unexpectedly destroyed by an enemy strike in the opening; the agent inherits a complete production chain (Construction Yard + Refinery + Power Plant + War Factory + Ore Truck + ore patch) and a reserve cash budget; the deadline is real. The model must notice the destruction (the live building count for the Power Plant drops to zero), commit the reserve to rebuilding the Power Plant (not to army units, not to a different structure, not to nothing) AND place it adjacent to the surviving Construction Yard so production stays online — fast enough that the rebuild closes the happened-before latch before the deadline bites.

Disaster recovery after a critical-infrastructure node is destroyed by an external event mid-mission. The autonomous operator inherits a working processing site, a reserve allowance sufficient for exactly one replacement substation, and a strict operational deadline. Recovery requires identifying the lost asset, committing the reserve to rebuilding the same kind of asset (not to defenders, not to expansion, not to nothing), and siting it safely — close enough to the surviving site that the powered loads stay online, far enough from the lingering threat that the new asset survives.

PlanBench replanning under exogenous loss, disaster recovery, exogenous loss, SC2 rebuild-after-trade

build-power-down-defensive

Power Down Non-Essentials Under Load — Reversible Load-Shedding

reasoning

rush-hour-arena

Grid load-shedding under a generator-down state: the agent starts with insufficient generation to cover its installed load (one Power Plant supplies 100 power; the installed drains total ~140), so the grid runs at NEGATIVE surplus from tick 0 (the engine slows production 50% in this state). The agent must REVERSIBLY shed non-essential loads via `power_down` (toggle the building off) so that provided ≥ drained — WITHOUT selling structures it will need later. Selling fails the `has_building` clauses on the load-bearing structures it ostensibly needs to keep around; powering-down the lone Power Plant collapses provided power and fails the `power_provided_gte:100` floor; stalling never restores surplus and times out.

Datacentre load-shedding when one generator trips: the operator must turn OFF non-critical lines (cosmetic lighting, secondary HVAC) so the surviving generator can carry the essential ones (life safety, core compute), WITHOUT decommissioning the shed equipment (it will be re-enabled the moment the second generator comes back). The correct primitive is a reversible disable, not a destructive one — this is exactly the `power_down` toggle vs `sell`.

PlanBench reversible-action planning, operations runbook / load-shedding, SC2 power management, lmgame-Bench resource-constraint reasoning

build-power-online-first

Build Power Online First — Grid Bring-Up Sequencing

reasoning

rush-hour-arena

Standard-operating-procedure (SOP) compliance under a strict happened-before constraint, exemplified by electrical-grid bring-up: the grid (the Power Plant) must be ONLINE before any powered load (the Refinery, Barracks, War Factory) can come up. The opening decision is the entire test — from a pre-placed Construction Yard and a tight budget, the agent must queue the Power Plant FIRST and only THEN the first production / economy building. A model that "just builds the goal building" (the refinery) first finds the order silently blocked (the engine enforces powr as a prerequisite of proc), wastes the budget retrying, and times out. A model that stalls or builds an army instead also misses the bar.

Standard operating procedure with a hard happened-before constraint: a robotic operator commissioning a new processing site must energise the substation BEFORE bringing up the refinery (or any other powered load). Attempting the refinery first is rejected by the interlock; reasoning about the order of operations from the spec is the entire decision.

PlanBench, SOP compliance, electrical-grid bring-up

build-production-throughput-multibuilding

Parallel Production Throughput — Build a Second War Factory to Hit the Quota

reasoning

rush-hour-arena

A delivery deadline forces a manufacturing-throughput decision. A fixed quota of vehicles must ship before a tight clock, and the agent starts with exactly ONE production line (war factory). A single serial line cannot clear the quota in time. The agent has cash for a SECOND identical line; building it and feeding both queues roughly doubles the per-tick output rate (parallel manufacturing — a second assembly line does not split the job, it doubles the service rate). The pack frames the classic queueing-theory call: when one server cannot meet the deadline, add a second server in parallel.

Fleet dispatch throughput under a delivery SLA: a warehouse must ship N orders before a deadline, but a single loading dock (one serial server) clears jobs too slowly to meet it. The operator has budget for a second identical loading dock; commissioning it and routing jobs to both docks doubles the dispatch rate. The decision is a parallelism call — recognise that the single server is the bottleneck and add a second server, rather than pushing harder on the one that is already saturated.

queueing theory, SC2 multi-factory throughput, manufacturing parallelism

build-rally-point-management

Set the Rally Point Forward — Production Logistics Under a Deadline

action

rush-hour-arena

Production logistics / warehouse SLA: a factory whose default ship-to is the loading dock will pile inventory at the dock — pieces never reach the field. The decision is a single ACTION — set the production building's RALLY POINT (its "ship-to" address) to a FORWARD staging area at the front, BEFORE the production clock starts burning. Once the rally is set, every subsequent freshly-built unit auto-routes to the front; the operator does not have to micromanage each batch. Without that one-shot reset, the default rally is right next to the building and the units pile at the base ~38 cells from the action, never engage in time, and the SLA (the tick deadline) is missed.

Production-pipeline standing-order routing: a manufacturing or fulfilment system whose default routing rule sends finished goods to the dock will starve the downstream customer; a single standing-order change ("ship every unit produced at this site to dock B at the forward distribution centre") re-routes the entire pipeline. This is also the SC2 rally-point primitive — one click on the production building, every subsequent unit walks to the rally cell — and the analogue is exactly what real-world logistics calls "set the default ship-to".

SC2 rally management, production logistics, warehouse SLA

build-repair-priority-under-fire

Repair Priority Under Fire — Triage the Refinery, Not the Decoy

reasoning

rush-hour-arena

Three of your structures are under simultaneous attrition and you have one repair organ to commit. The damage picture is deliberately misleading: the pillbox (pbox) is pre-damaged to ~30% HP and LOOKS the most damaged, but it is low-value, heavily armoured, and the grenadiers barely scratch it — it survives on its own. The refinery (proc) starts intact but is on a LETHAL trajectory: its grenadier band kills it within a few turns unless you toggle `repair` on it immediately. The intended decision is criticality-weighted triage: rank by value x lethal-trajectory, not by raw damage percent. Repair the proc first (and on the harder tiers the war factory too); skipping the refinery to "fix the worst-looking building" loses it, and stalling loses it.

SRE / disaster-recovery incident triage: several services are degraded at once and the operator has one repair lever. The correct move is to rank by criticality x blast-radius, not by whichever dashboard is reddest — restore the income-bearing / load-bearing subsystem that is about to fail hard before touching a loud-but-low-value one. The same criticality-weighted prioritisation governs SC2 SCV-repair target choice and aircraft / plant preventive maintenance.

disaster recovery triage, criticality-weighted maintenance, SC2 SCV repair

build-sell-and-rebuild-elsewhere

Sell and Rebuild Elsewhere — Recoup Capital, Relocate Production

reasoning

rush-hour-arena

A forward refinery (proc) sits in the path of an incoming enemy hunt band that will raze it within ~25-30 turns; starting cash alone does not cover building a new proc at the safe target region. The only path to a fresh proc inside the tick budget is to SELL the exposed proc (recouping 50% of its build cost) and use the refund plus starting cash to BUILD a new proc and PLACE it at the safe target region far from the rush. Stalling, building without selling (cash gated), and placing the new proc in the wrong region all lose; only sell-then-rebuild-at-safe-region wins.

Liquidate a deteriorating asset to fund a relocation: a forward production node is about to be lost to environmental damage, and the capital reserve alone is insufficient to commission a replacement node elsewhere. The right move is a deliberate salvage of the at-risk node (recovering ~half the build capital in liquid form) which, combined with the on-hand reserve, funds a new node at a safer site BEFORE the original is lost for zero recovery. Letting the asset be destroyed loses 100% of its capital; salvage-and-redeploy preserves 50% + funds the new site.

capital reallocation, SC2 sell mechanic, financial reallocation

build-sequence-tech-cheapest

Cheapest War Factory — Cost-Minimal powr → proc → weap Build Order

reasoning

rush-hour-arena

Cost-minimal build-order planning under a fixed, non-replenishing budget: the agent must reach the war factory (`weap`) by spending the LEAST cash on the ONLY affordable prerequisite chain (powr → proc → weap). There is no ore and no income — the starting cash is the entire budget, tuned to exactly the cost of the minimal path. Any detour through a non-load-bearing structure (a barracks, a pillbox, an early infantry unit) bloats the bill of materials and exhausts the budget; the war factory can then never be funded. Tests that the model can plan the minimum-COST prerequisite chain — not merely SOME plan that arrives — under a budget that only the cost-minimal plan can afford. Sibling of build-sequence-tech-fastest (the time-optimal axis); here money, not the clock, is the constraint.

Capital-minimal commissioning of an autonomous manufacturing cell under a fixed procurement budget: the cell must bring a target machine online by buying only the strictly required upstream stations (power → feedstock → assembly). The budget is fixed and non-replenishing; procuring a non-required station (an extra quality bench, a spare buffer) spends capital that the assembly station then cannot be funded with. Only the minimum-cost precedence chain stays within budget.

PlanBench cost-optimal, BOM cost minimization, budget-constrained planning

build-sequence-tech-fastest

Fastest War Factory — Cost-Optimal powr → proc → weap Build Order

reasoning

rush-hour-arena

Cost-optimal build-order planning under a tight deadline: the agent must reach the war factory (`weap`) on the shortest prerequisite path (powr → proc → weap). Any detour through unneeded structures (a barracks, a second power plant, an early infantry training queue) bloats the bill-of-materials and overruns the budget. Tests that the model can plan the minimum-cost prerequisite chain — not merely SOME plan that eventually arrives — under a deadline that only the optimal plan satisfies.

Critical-path planning in autonomous manufacturing: a cell must bring a target machine online by a fixed cycle-time, choosing the minimum set of upstream stations to commission first (power → feedstock → assembly). Adding non-load-bearing stations to the ramp-up plan (a non-required quality station before assembly) blows the deadline; only the cost-optimal precedence chain meets spec.

PlanBench cost-optimal, BOM manufacturing

build-sequence-tech-most-resilient

Resilient War Factory — Redundant Power Survives a Strike (N+1 Build Order)

reasoning

rush-hour-arena

Robust build-order planning: reach a tech capability AND keep it through a foreseeable disturbance. The agent must bring a war factory online and field an armoured force, but a mid-episode enemy strike razes one power plant. A build order that provisions only a single power plant is a single point of failure — when the strike lands the factory drops to low power and the army never completes in time. The resilient build order pre-builds a second, redundant power plant before the strike, so the grid stays in surplus and production never slows. Tests whether the model plans for the disturbance (N+1 redundancy on the critical prerequisite) rather than merely planning the shortest path to the goal.

N+1 redundancy on a critical utility. An autonomous production cell depends on a power feed to run its assembly machine; a known hazard will knock out one feed mid-shift. Resilient planning commissions a second, independent feed BEFORE the outage, so the assembly machine never drops below rated throughput. Provisioning only one feed — the shortest plan to first article — halts the line the moment the hazard strikes and blows the delivery deadline.

PlanBench robust planning, N+1 resilient design, redundancy

build-tech-skip-decision

Skip the Unneeded Tech Tier — Clear a Light Garrison with a Basic e1 Swarm

reasoning

rush-hour-arena

The objective only requires basic units. The agent starts with a Construction Yard and an Allied barracks already standing, so cheap rifle infantry (e1) are trainable from turn 1 with no prior tech step, and a light enemy garrison — also basic infantry — is incoming. An e1 swarm clears a light infantry garrison comfortably. The trap is to climb the full tech chain (power plant → refinery → war factory → service depot → medium tanks) to bring a higher tech tier online that the objective never asked for: that whole tier costs budget and, critically, clock — by the deadline no tank has fielded at all. The pack tests whether the model recognises that the goal does not require high tech and skips straight to the cheap path.

Right-sizing the plan to the task. A delivery only needs a basic pick-and-place cell that is already commissioned; provisioning a full high-precision assembly line (extra power, feedstock, an assembly station, a calibration depot) before starting work is a whole capability tier the job never required — it burns budget and blows the cycle-time deadline. The competent planner prunes every step the goal does not consume and executes the minimal plan with the capability already on hand.

PlanBench unnecessary-step pruning, lean process, YAGNI

building-and-planning

Base Building — Construct Structures Respecting the Tech Tree

reasoning

rush-hour-arena

Construction planning under dependency and spatial constraints: decide a build order that respects engine-enforced tech-tree prerequisites (a barracks needs power; a pillbox needs the barracks), place structures where the objective requires (a defended direction), and when needed creep the base across the map to found the defensive line in a designated far region. The decision is the plan — order, placement, and relocation — not the motor control of any single build. The prerequisite is genuinely enforced by the engine: a power-less barracks NEVER completes, so a greedy "build the goal building first" policy cannot win.

Autonomous construction / facility-layout planning: a task-graph with prerequisites (B needs A, C needs B) plus spatial goals (assemble in zone Z; relocate and found the depot near region R) under a time budget — out-of-order or mis-placed construction never satisfies the goal.

combat-attack-from-behind-fog

Reasoning — Attack from Behind via a Fog-Bypass of the Frontal Line

reasoning

rush-hour-arena

Four medium tanks (2tnk) at the west edge face a tight vertical wall of pillboxes (pbox) and anti-tank rocket soldiers (e3) at x=50, spanning y=15..25 and facing west. The objective is an UNDEFENDED enemy construction yard (fact) at (100,20) — behind the line. Charging head-on along y=18..22 puts the lead tank inside the kill envelope of 4+ defenders simultaneously and the column never clears the line in time to raze the fact. The winning play is the FOG FLANK: route the strike force to the far north (y=2) or far south (y=38) — well outside the line's range AND vision — drive east past the line's longitude, then turn inward to descend on the fact from BEHIND. Same enemy, same forces, but the route of approach controls whether the line's prepared fields of fire can engage at all.

Spatial-reasoning under prepared defenses: an intelligent attacker refuses the defender's chosen geometry (the frontal kill envelope) and routes via a corridor outside the defender's sensor + weapon coverage to reach an undefended high-value target. The decision under test is route planning — recognise that a static defensive line is fixed and bypass-able, not that it must be reduced face-to-face.

SC2 hidden assault, military surprise attack, fog warfare

combat-bait-counter-attack

Bait + Counter-Attack — Pull the Guards Off Post, Strike the Undefended Yard

reasoning

rush-hour-arena

A reactive defending cluster (scripted `guard` bot: holds its post, auto-fires in range, lunges at the nearest foe within an aggro radius, snaps back past a leash) is bunched on the approach side of the enemy construction yard. Driving the strike force straight in eats anti-tank fire from the whole cluster and runs the attrition cap before the objective falls. The winning policy commits a cheap fast bait (jeep) on a DIVERGENT vector that comes just close enough to make a SIDE of the cluster lunge off post after the bait, then runs the strike tanks around through the now-vacated flank to destroy the construction yard while it is briefly undefended. Phase 1 (the bait pull) yields zero objective progress and looks locally negative; phase 2, through the slot the bait opened, scores. The credit problem is FEINT-AND-FLANK: the bait must displace a segment of the cluster and the strike must exploit the reversible vacancy in time; bait-only never razes the yard, brute frontal trades the strike force against the full cluster, stalling loses the clock.

Sacrificial-decoy planning against a reactive guarding obstacle: a no-reward enabling action (entice a leashed interceptor off the objective with a decoy on a divergent vector) must precede and temporally overlap the rewarded objective action (commit the payload force on the briefly-cleared flank to destroy the objective before the interceptor snaps back).

SC2 bait micro / sacrificial pull, military feint-and-flank doctrine, CICERO / Diplomacy deception

combat-divide-and-conquer

Combat Reasoning — Divide and Conquer Two Mutually-Supporting Enemy Clusters

reasoning

rush-hour-arena

Four medium tanks (2tnk) at the west edge face TWO enemy clusters at x=60 — a NORTH cluster at y=15 and a SOUTH cluster at y=25 — each composed of 3 anti-tank rocket soldiers (e3, Dragon, range ~5) plus 1 light tank (1tnk). The clusters are 10 cells apart; pushing east on the midpoint axis (y=20) puts the lead tank inside weapon range of BOTH clusters at once, and under AttackAnything stance the 8-unit mass converges and concentrates fire on the lead, busting the survival bar before either cluster is cleared. The winning play is divide-and-conquer (defeat in detail): flank well NORTH (e.g. y=5) so only Cluster A's units are in range and line-of-sight; eliminate A while Cluster B is still ≥20 cells away and closing; then pivot SOUTH (y=35) and repeat against B in isolation. Each engagement is a clean 4-vs-4 trade instead of a 4-vs-8 mass.

Two-front engagement with mutually-supporting defenders: hitting the midpoint between coupled adversaries lets BOTH fire on the leading agent simultaneously, doubling the incoming DPS share. Spatial positioning that breaks line-of-sight or stays outside the second adversary's aggro / weapon envelope sequences the encounter into two 1-vs-1 trades (favourable force ratio per trade), at the cost of route length and clock.

SMAC squad-isolation, CICERO splitting, military divide-and-conquer

combat-flanking-attack

Combat Micro — Flank a Frontal Anti-Tank Line Instead of Charging Head-On

action

rush-hour-arena

Four medium tanks (2tnk) at the west edge face a tight vertical line of anti-tank rocket soldiers (e3) at x=60, fronted by a shield of rifle infantry (e1). The line "faces west" — every defender's weapon envelope covers a tank approaching head-on along the engagement axis (y=20). Charging head-on puts the lead tank inside Dragon range of EVERY rocket soldier in the stack simultaneously and inside small-arms range of the entire shield; concentrated fire destroys the lead before the column clears the line, busting the survival bar. The winning play is the flank: move the strike force off-axis (north or south of the line, well beyond y=22 or y=18), then approach the line END-ON so only one or two defenders are in range at any moment. Same enemy, same forces, but the angle of attack controls how many enemy barrels can bear on the leading tank.

Multi-agent strike-package geometry: the bearing of contact against a static defender controls how much of the defender's firepower can engage simultaneously. The decision under test is spatial routing — refuse the high-attrition frontal trade and approach via an axis that sequences the engagement 1-vs-1 rather than 1-vs-N.

SC2 flank micro, military flank maneuver doctrine, tactical: avoid frontal trade, force-multiplier through angle of attack

combat-focus-fire-priority

Focus-Fire Priority — Kill the Anti-Tank Threat FIRST

action

rush-hour-arena

Four medium tanks face a small mixed enemy squad at close range: a single high-DPS rocket soldier (anti-vehicle) escorted by 2-3 rifle infantry, all visible in the centre. The decision is target prioritization: concentrate ALL tanks' fire on the rocket soldier FIRST (4-vs-1 kills it in 1-2 decision turns before it can fire more than a couple of rockets), then mop up the infantry. Spreading fire across the squad — via attack_move auto-targeting or by starting on the closer rifle infantry — leaves the rocket soldier alive long enough to kill ≥2 tanks, busting the attrition cap. A brute attack-anything play LOSES; the focus-fire play WINS.

Military strike-package doctrine — when engaging a mixed enemy package, the high-value / high-DPS asset (AA battery, anti-armour weapon, command vehicle) is neutralised FIRST, even when it is not the nearest target. The lower-DPS escort is mopped up after. Target prioritization under fire is the controlled capability: nearest-first or auto-target spreads damage and lets the priority threat keep firing.

SC2 focus-fire micro, MicroRTS target prioritization, military strike-package: hit AA first, RPS-counter unit prioritization

combat-formation-tank-wedge

Combat Micro — Tank Wedge Through a Bracketing Fire Corridor

action

rush-hour-arena

Five medium tanks (2tnk) at the west edge must drive east through a bracketing fire corridor (anti-tank rocket soldiers nested above AND below the engagement axis, plus a single light tank blocker on-axis) and reach an objective region in the east with most of the force intact. Marching as a column along y=20 puts the lead tank inside Dragon range of BOTH brackets at once; concentrated cross-fire destroys the lead, then the next tank inherits the kill envelope, and the column bleeds itself dry before clearing the gap. The winning play is the WEDGE — lead tank on-axis at y=20 absorbing the on-axis blocker's fire, flankers offset to y=18 and y=22 trailing one cell west so they engage each bracket END-ON from off-axis (only 1-2 rocket soldiers from each cluster can fire on a flanker at once). The formation's SHAPE controls how much enemy firepower can bear on the lead unit and sequences the bracket engagements into 1-vs-1 trades instead of N-vs-1 crossfire.

Multi-agent strike-package formation: the geometric SHAPE of the moving formation controls weapon-bearing concurrency on both sides. A column maximises target overlap on the lead unit (worst case for surviving cross-fire); an inverted-V wedge spreads the engagement across the formation's width so each enemy bracket faces only 1-2 of the wedge's units at a time, and the lead's job is to ABSORB the on-axis threat while the flank wings dismantle the off-axis threats. The decision under test is formation discipline: order the force into a wedge BEFORE the contact, not after losses force a retreat.

military tank-wedge doctrine, SC2 formation micro, combined-arms, strike-package geometry: formation shape vs cross-fire

combat-harass-aggro-commit

Combat Harass — AGGRO Commit: Fight Through the Defender

action

rush-hour-arena

A small raider force is staged west; a cluster of enemy harvesters works an ore patch around their refinery in the centre-east, with a single heavy tank standing on-post as the defender. The high-score doctrine is to COMMIT — concentrate fire on the defender first (3-vs-1 tank trade favours the attacker), then mop up the undefended harvesters. Stalling or pure retreat misses the kill bar; attacking the harvesters while ignoring the defender lets the heavier 3tnk pick off the raiders one by one. The intended play accepts some attrition to score both targets: aggro raid doctrine, not skirmish-and-pull-back.

Military forward-attrition / aggressive raid doctrine — a small expeditionary force commits to a higher-impact engagement (defender + payload) rather than disengaging at first contact, accepting losses to achieve the larger objective. The decision is threat-priority and commitment, not micromanagement of retreats.

SC2 aggressive worker harass with commit, military forward attrition warfare, RTS aggro doctrine: accept losses for higher kill ratio, guerrilla raids with kill-everything objective

combat-harass-balanced-hit-and-run

Balanced Hit-and-Run — Pulsed Worker Harass under a Leashed Defender

reasoning

rush-hour-arena

BALANCED pulsed harassment under a reactive defender. Two fast raiders (jeeps) must score N kills against a cluster of soft enemy workers overwatched by a leashed HEAVY TANK (anti-vehicle, immune to the jeeps' anti-infantry MG and lethal to them in one volley). The defender holds its post a few cells off the cluster but lunges at any foe within AGGRO=16 of the post and snaps back past LEASH=18. Committing inside the guard's cannon envelope until the workers die loses jeeps (cap is 0 on medium — the BALANCED bar is that retreat ACTUALLY worked). Sitting out of aggro never scores. The only winning policy is the pulse: approach to MG range of the workers → strike one → retreat past leash → guard snaps back → re-engage → repeat. Retreat is load-bearing — the first pack in the bench where it is required, not merely permitted.

Pulsed manipulation under a reactive guarding agent: the policy must complete a sequence of small reward-bearing actions inside the guard's intervention radius, then disengage past the leash before damage is taken, then re-enter — without taking any hit at all. A commit-until-done policy gets damaged; a stand-off policy never scores. The temporal structure of the reward (cycle of engage / disengage) is what is being tested.

SC2 muta-harass / sentry-harass (pulsed engagement), military guerrilla warfare (hit-and-run doctrine), pulsed-load attack with retaliation avoidance, RTS balanced harass: kill workers without losing raiders

combat-heli-flank

Heli Flank — Air Mobility Over a Ground Wall

action

rush-hour-arena

A two-helicopter strike must engage an enemy infantry cluster pinned behind a contiguous wall of pillboxes — a barrier that denies any ground push inside the tick budget. The decision under test is air-mobility recognition: only the helicopter `move_units` / `attack_unit` flight path crosses the wall; sending a ground unit at the same target stalls against the impassable footprint and never reaches the cluster. Stall (`observe` only) never engages; a ground brute push (no air units exist in the agent force, so this degenerates to "no valid plan") fails the kill bar.

Vertical envelopment / airmobile assault doctrine — the helicopter exists to bypass terrain that denies the ground arm. The same logic as a UAV swarm routing over a denied surface corridor: pick the right modality for the obstacle.

AH-64 deep-attack flight profile, airmobile vertical envelopment

combat-hold-chokepoint

Combat Micro — Hold a Narrow Pass Against a Larger Force

action

generated

A small medium-tank squad must defeat a numerically larger enemy light-tank force by HOLDING the chokepoint — a per-tier-wide corridor that is the only path across the map. The terrain caps how many attackers can bring weapons to bear at once: anchored at the corridor mouth, the squad faces only the few enemies that fit the lane abreast and grinds the larger force down piecemeal. The same squad fighting in the OPEN — having charged east through the corridor, or having pulled west and let the enemy spill into the open — is surrounded by the whole force at once and focus-fired down. The decision under test is positional: anchor the squad IN the chokepoint where the geometry does the work.

Defensive positioning at a terrain bottleneck: a small team holds a corridor / doorway / bridge so an adversary's numerical advantage is neutralised by frontage — only as many opponents as the bottleneck is wide can engage simultaneously. Choosing to fight where the geometry caps the enemy's effective force (rather than on open ground where the full force concentrates) is the load-bearing spatial-reasoning decision. Anchors: military chokepoint / defile defense, Thermopylae (480 BC), the StarCraft 2 ramp hold.

military chokepoint defense, Thermopylae, SC2 ramp hold

combat-kite-and-pull

Combat Micro — Kite and Pull a Slow Heavy Force

action

generated

A fast light strike force must destroy a slower, heavier enemy that out-trades it head-on. The only winning play is the hit-and-PULL cycle: each turn, strike the heavy at weapon range, then RETREAT the strike force out of the heavy's lethal close-range window before it can fire back — and repeat. Standing and fighting LOSES: the heavy cannon collapses the light force's HP before its own runs out. The skill being measured is combat micro under a mobility asymmetry — exploit the speed edge by stringing together move-away + attack cycles instead of issuing one beeline charge.

A fast/light agent team defeating a slow/heavy adversary by exploiting a mobility asymmetry: a closed-loop evade-then-engage policy rather than a one-shot commit. The per-turn decision is proximity control — stay outside the adversary's lethal radius while delivering effect at standoff range.

SC2 kiting micro, cavalry skirmish doctrine, military fire-and-maneuver doctrine, economy-of-force

combat-kite-jeep-vs-tank

Combat Micro — Kite a Slow Heavy Tank with Fast Raiders

action

rush-hour-arena

Three fast tank raiders must kill ONE enemy heavy tank that is actively HUNTING the raiders' centroid. The only winning play is kiting: each turn, if the heavy tank is closing into one-shot range, MOVE the raiders AWAY (using their speed advantage) and attack_unit the heavy from outside its lethal close-range window. Repeat until the heavy falls. Stand-and-fight LOSES — the heavy tank's cannon out-trades raider weapons at close range and a static engagement collapses raider HP before the heavy's. The skill being measured is combat micro: target a moving threat, exploit the unit-speed asymmetry, and string together a sequence of move-away + attack_unit cycles instead of issuing one beeline order.

Fast/light agents defeating a slow/heavy adversary by exploiting a mobility asymmetry: a closed-loop policy of evade-then-engage, rather than a one-shot beeline. The decision under test is per-turn proximity control (stay outside lethal radius, fire at range).

SC2 kiting micro (vulture/muta-vs-marines), cavalry-vs-pikeman maneuver, military fire-and-maneuver doctrine, skirmisher tactics

combat-naval-shore-strike

Naval Shore Strike — DD shells a coastal garrison

action

generated

A destroyer on water faces a small infantry garrison on the adjacent shore. The decision: target the garrison with the destroyer's primary armament from across the water (the destroyer cannot move onto land). A stall policy never engages; a policy that tries to drive the destroyer ashore stalls at the shoreline.

Naval gunfire support — the asset operates in a domain (water) that constrains its movement and weapon employment, but its weapon range crosses the domain boundary onto land.

naval-mvp, amphibious gunfire support

combat-pincer-coordination

Pincer Attack — Two Squads Strike Simultaneously From Two Sides

action

rush-hour-arena

Two armoured squads start at OPPOSING latitudes on the west edge (one to the north, one to the south) and must converge on a central enemy cluster simultaneously, hitting the defender from two sides at once. Sending a single squad alone fails on two counts: only 3 tanks (not the required 4) can occupy the objective region, and the lone squad is shredded by the cluster's mass anti-armour before clearing it. Sending both squads but not synchronised (one held back, the other commits first) lets the cluster focus on the lead squad and destroy it before the trailing squad arrives, busting the attrition cap. Only a true simultaneous two-prong commit clears the cluster cleanly and leaves the joint force standing on the objective.

Synchronous two-team pincer attack on a contested objective: each team approaches on a different bearing and both must arrive within a common window so the defending agents cannot focus on either team in detail. Serialising one team behind the other lets the defender concentrate fire and destroy the lead team before reinforcement; only the joint simultaneous commit overwhelms the defence and preserves the strike force.

SC2 multi-prong / two-prong attack timing, military pincer movement doctrine, envelopment from multiple angles, synchronisation of dispersed forces

combat-prevent-retreat

Combat Encirclement — Cut Off the Enemy's Retreat Before You Strike

action

rush-hour-arena

A cluster of enemy rifle + rocket infantry sits at the centre of the map. A head-on column charge brings every tank within rocket range of the cluster and bleeds the strike force below the survival cap. The winning play is the Cannae / encirclement idiom: detach ONE tank on a flank route (around the enemy cluster via y=5..10 or y=30..35, out of rocket range) to take the eastern anvil position at (85,20), then commit the main body of THREE tanks from the west. The win predicate explicitly requires the agent to have ESTABLISHED an eastern cut-off (≥1 own unit at (85,20,r=8)) at the moment the kill bar is met AND to have ≥3 tanks alive. Brute charging without the cut-off bleeds the force under the survival cap — a column attack-moving into the compact stationary e3 wall loses ≥2 tanks whether it heads east or for the centre. Stalling never opens fire, also a LOSS on the clock.

Multi-agent strike geometry: the placement of a "blocker" / "anvil" unit on the opposite side of the target BEFORE the main engagement opens is what converts a shove into a kill. The decision under test is spatial sequencing — establish the cut-off via a path that avoids the enemy's effective weapon envelope, THEN engage; reversing the order forfeits the capability that the cut-off enables.

military encirclement, Cannae doctrine, SC2 encirclement, the hammer needs an anvil

combat-protect-vip-escort

Protect the VIP — Escort a Fragile Unit Through the Hazard

action

rush-hour-arena

A single fragile high-value unit (the VIP — an unarmed harvester, the only one of its kind on the map) must reach the east extraction point. Four medium tanks at the same west staging zone are the protective detail. The VIP cannot fight (no weapon); an enemy interception force — a Soviet heavy tank backed by rifle infantry — patrols the route and runs the lone VIP down. The escort must move AHEAD, destroy the interceptors, and only then let the VIP cross. Sending the VIP alone or at the head of the column gets it caught and killed; the escort's offence preserves the VIP.

Protected-asset transit / VIP detail / diplomatic convoy: a high-value principal (unarmed, fragile) must traverse a contested route. The protective detail moves in bounding formation ahead and to the flanks, neutralising threats along the route before the principal traverses. The principal does not engage directly — its survival depends on the detail's offence, not its own fight.

SC2 VIP-survival missions, military VIP protection doctrine, diplomatic security / convoy escort, fragile-target preservation tactics

combat-retreat-after-engagement

Combat Retreat — Disengage to Preserve the Force

reasoning

rush-hour-arena

Four medium tanks face a numerically superior enemy squad (rocket infantry + heavy tanks) at close range. The fight is unwinnable head-on: the rocket infantry and heavy cannons together out-trade the column inside the loss cap. The only winning play is the SC2 retreat-micro idiom: commit briefly to score the kill bar on the soft anti-tank infantry (e3), then break contact and pull the surviving tanks BACK to the safe rally point in the west before attrition busts the force-preservation cap. The decision under test is disengage TIMING — stalling never engages (kill bar unmet), brute attack-until-death loses the force, and only the engage-then-retreat play wins.

Military tactical-withdrawal doctrine and the SC2 retreat-micro pattern — when a battle goes poorly, the policy must IDENTIFY the losing trade and pull back to a safe rally before the unit is destroyed. The skill is reading the engagement state and issuing a withdrawal order at the right moment; over- or under-commitment both fail. Preservation of force is its own objective.

SC2 retreat micro / disengage timing, military tactical withdrawal doctrine, preservation of force / live-to-fight-another-day, skirmish-and-pullback tactics

combat-rocket-soldier-anti-vehicle

Rocket Soldier vs Heavy Armour — Pick the Hard-Counter Unit Type

reasoning

rush-hour-arena

A starting cash budget that funds exactly one coherent composition must be allocated against a pre-placed enemy band of HEAVY TANKS (3tnk Soviet heavy on easy/medium, 4tnk Mammoth on hard). The agent must train ROCKET SOLDIERS (e3, the anti-vehicle Dragon launcher), not light tanks (1tnk — the budget buys only ~2, which lose attrition to heavy armour) and not rifle infantry (e1 — no anti-armour weapon, racks up zero kills against tank armour). The decision is the RPS counter CHOICE (matchup-winning unit type) given the visible threat composition.

Anti-armour procurement against an armoured concentration: a fielded inventory of TOW / Javelin / RPG launchers defeats a main-battle-tank column, while an equivalent budget spent on light AFVs or rifle squads is force-on-force inferior. Asset CLASS — not cost or count — determines the engagement outcome.

SC2 hard-counter, anti-armor procurement, military RPS

combat-skirmish-then-disengage

Combat Skirmish — Strike, Score the Kills, Pull Back to Recovery

action

rush-hour-arena

SKIRMISHER doctrine in the single-engagement frame: four fast raiders (jeeps) must drive east into a slow infantry cluster, score AT LEAST 3 kills, and then PULL BACK to the recovery zone around the western start before the clock expires AND while keeping at least 3 raiders alive. The skill under test is the decision to STOP FIGHTING and disengage — committing until the enemy is wiped or until the strike force is destroyed both LOSE (commit leaves the raiders at the kill site instead of the recovery zone; over-commit on hard loses raiders to the hunt-bot spawn waves). Distinct from the BALANCED pulsed harass-retreat cycle (combat-harass-balanced-hit-and-run, which is many small pulses with zero attrition): this pack is ONE big engagement done well, with a positional recovery bar.

Mission-with-egress: a mobile manipulator must complete a threshold of reward-bearing actions in a contested workspace, then return to a safe staging region before a time or attrition budget expires. Knowing WHEN to stop the productive sub-task and start the egress is the decision under test — a productivity-only policy (greedy accumulation) leaves the agent far from the staging region at deadline and fails the egress clause.

SC2 skirmisher tactics, military reconnaissance-by-fire, harass-and-disengage doctrine, armoured cavalry doctrine

combat-stance-mgmt-attack

Hunt Authorisation — Lift Stance to Pursue Scattered Enemies

action

rush-hour-arena

An engagement-authority escalation drill: four medium tanks are pre-staged at the west edge of the arena on RETURN-FIRE (stance:1) — they will not open fire unless attacked. A scattered enemy force (riflemen + a light tank) is spread across the eastern half of the map at positions that don't bring the fight to the agents. The mission is to call `set_stance(units, 3)` to escalate the formation to AttackAnything so the engine's stance:3 hunt path advances each tank to the nearest scattered enemy and wipes them out. Without the escalation the formation idles (return-fire never triggered, hunt never licensed) and the deadline bites as a real LOSS.

A patrol-vs-pursuit authority switch: a perimeter defence team is on weapons-tight standing orders (return fire only). The operator detects scattered hostiles deep in the perimeter that aren't engaging the defenders. The operator must issue the "weapons free, hunt and clear" command to escalate the team from defensive return-fire to active pursuit. The skill is the operator's decision to ESCALATE the engagement-authority at all — not the precise moment within the engagement window.

military ROE escalation, SC2 stance micro

combat-suicide-charge-mission

Suicide Charge — Sacrifice the Force to Raze a High-Value Objective

reasoning

rush-hour-arena

Forlorn hope / military sacrifice doctrine. A high-value enemy construction yard (`fact`) sits at the far east of the map, defended by an anti-armor picket; the agent's small strike package at the west cannot punch through without heavy losses. The OBJECTIVE is the building's destruction — keeping the force intact is NOT required and is not achievable. A "preserve the force" policy that engages carefully and tries to save units loses on the clock (the picket grinds the lead element down and the rest never arrive); a half-commit that advances and halts short of the objective also times out. Only an all-in commit that drives the WHOLE force decisively at the objective and focus-fires through the picket — trading most of the attackers for the kill — actually razes the fact in time. This is delayed-terminal credit assignment with an explicit cost-vs-objective trade-off: every unit lost looks locally negative, but the only successful plan accepts that the bulk of the force is spent and only a remnant survives to land the killing blow.

Expendable strike package / single-use intervention: a fleet of cheap drones must destroy a critical adversary asset under a hard deadline; loiter or stand-off attacks miss the window, so the planner must commit every platform on a one-way attack trajectory, accepting total platform loss as the operating cost of mission success.

military sacrifice doctrine / forlorn hope (West & East canonical), SC2 expendable strike package (all-in cost-objective trade), MicroRTS / SC2LE attack-the-base under deadline with attrition-OK

combat-tank-vs-tank-engagement

Tank-vs-Tank Mirror — Focus-Fire, Lanchester Square Law

action

rush-hour-arena

A three-tank strike force engages a stationary enemy tank line. The decision under test is combat micro: close to cannon range, HOLD the engagement at range, and concentrate `attack_unit` fire on one target at a time — eliminate the nearest enemy, then the next, working down the line. Per the "concentration of force" doctrine and the Lanchester square law, a force that holds and focus-fires removes enemy OUTPUT DPS one whole tank per kill and clears the line keeping its strength; a force that brute `attack_move`s straight INTO the enemy position bunches itself in the enemy's midst, absorbs the whole line's crossfire at once, and is wiped before it can clear the engagement. On medium the agent is numerically out-gunned 4-vs-3, so the controlled engagement is load-bearing: only a held, concentrated focus-fire push clears ≥3 of the 4 enemy tanks while keeping ≥2 of its own. Stalling never engages and loses on the kill bar; the brute drive-in loses on the survival cap / kill bar; only the controlled focus-fire engagement wins.

Military "concentration of force" doctrine (one of the Principles of War): a smaller or equal force concentrated at the decisive point can defeat a numerically equivalent dispersed enemy. The per-kill removal of enemy OUTPUT DPS is the closed-form Lanchester square-law advantage of concentrated fire; the test mirrors the SC2 mirror-tank micro / marine-vs-marine engagement where the side that focus-fires one target at a time wins the trade against a numerically equal foe spreading fire across the whole line.

SC2 mirror micro, Lanchester square law, concentration of force

combat-tanya-vs-rush

Tanya vs e1 Rush — Hero Engagement

action

rush-hour-arena

A single Allied commando (Tanya) holds the centre against a small rush of conscript rifle infantry. Tanya is a hero unit: 3x the HP of a basic rifleman, fast, and her sidearm out-DPSes a small e1 pack. But she spawns on hold-fire stance — a do-nothing policy leaves her standing still under fire and she dies. The decision is "actively engage the hero asset" vs "observe / stall". The intended policy issues attack-move (or flips her stance to defend /attack-anything) and wins; stall loses.

Hero / commando doctrine — a single high-DPS asset is force-multiplying only when active. A door-kicker robot held in standby while threats approach is wasted. The capability is "commit the hero asset to the engagement" — operators who fail to commit, lose the asset and the engagement.

RA hero unit (Tanya commando), RTS micro: engage the hero, asymmetric high-value asset doctrine

combat-target-priority-highvalue

Target Priority — Kill the High-Threat Units FIRST

action

rush-hour-arena

Four medium tanks face a mixed enemy cluster at close range: a screen of cheap rifle infantry (e1 chaff) backed by THREE high-threat anti-armour rocket soldiers (e3). The decision is threat-weighted target prioritization: concentrate ALL FOUR tanks' fire on the rocket soldiers FIRST (each dies in 1-2 decision turns to 4-vs-1 fire, before it can land more than a couple of anti-tank rockets), THEN mop up the chaff. Engaging the chaff first — via attack_move auto-targeting the nearer rifle screen, or by explicitly attacking the cheap infantry — leaves the three rocket soldiers firing through the entire mop-up and they whittle the squad below the survival floor. A brute attack-move play and a kill-chaff-first play both LOSE; the threat-first focus play WINS.

Military target prioritization under fire — when engaging a mixed force, the high-threat asset (anti-armour weapon, AA battery, command vehicle) is neutralised FIRST, even when it is not the nearest target, because it out-damages the cheap escort. Killing the low-threat chaff first lets the priority threat keep firing and attrits the friendly force. Threat-weighted engagement — not nearest-first — preserves the squad.

SC2 focus-fire target priority, threat-weighted engagement, military target prioritization

combat-vehicle-vs-infantry-counter

Hard-Counter Selection — Tanks vs Rockets vs an Infantry Threat

reasoning

rush-hour-arena

A fixed cash budget funds EITHER a small armoured fist (3× 2tnk medium tanks @ $850) OR a mass anti-tank rocket squad (8× e3 rocket soldiers @ $300) OR a rifle company (up to 25× e1 @ $100). The enemy is PURE rifle infantry (e1 mass entrenched at centre). The correct rock-paper-scissors counter is armour: heavy tanks soak small-arms fire and shred soft targets at range. Rockets are the WRONG counter — anti-tank munitions against soft infantry waste cost-per-effect and the rocket squad's short stand-off + low HP gets out-DPSed by the rifle mass on attrition. Matching the enemy with own rifles is a 1:1 trade with no positional advantage and loses. Stalling never reaches the kill bar. The capability under test is hard-counter SELECTION: scout the threat composition, infer it has no anti-armour, and commit the WHOLE budget to the dominant counter.

Capability counter-selection in defense procurement / fleet composition: buying anti-tank guided missiles is wrong against a soft-target infantry threat (cost-per-effect waste), buying rifle squads matches the threat 1:1 with no advantage, buying armoured vehicles is the dominant choice. The decision under test is reading the threat profile and committing the whole capital budget to the dominant counter rather than hedging across categories.

SC2 hard-counter, military RPS counter, capability-based defense

coord-converge-on-target

Convergent Attack — Three Squads, One Defended Target

action

rush-hour-arena

Three armoured columns start on three different bearings (NORTH, WEST-of-objective, SOUTH) and must converge on a single defended enemy construction yard from three different directions. The win requires every tank of all three columns to reach the objective region with the yard destroyed: a single-column tour delivers only a third of the force, a serialized two-column attempt only two thirds — neither reaches the joint-arrival threshold. The advertised capability is synchronous multi-fleet convergence on a shared objective; a stall loses on the clock.

Synchronous multi-robot rendezvous on a contested objective: each team must dispatch on a different ingress vector so that all teams arrive at the goal; dispatching only one or two teams leaves the joint payload threshold (≥n teammates on the goal) unmet and the objective unfulfilled.

SC2 triple-prong assault timing, military convergent attack / pincer movement, SMAC squad convergence, envelopment doctrine: hit from multiple directions simultaneously

coord-cover-and-move

Bounding Overwatch — One Squad Covers While the Other Moves

action

rush-hour-arena

Two armoured squads must cross a centre-of-map FIRE ZONE held by static anti-tank infantry. Charging both squads through together stacks every enemy's fire onto a single dense column and busts the attrition cap. Sending one squad alone leaves the other idle and the lone column still absorbs ALL the cluster's fire. The intended capability is bounding overwatch: one squad stops just OUTSIDE the cluster's range and acts as the COVER team — firing on the cluster to draw its attention from the closer flank — while the BOUNDING team sweeps WIDE through the periphery (outside enemy sight) to a forward position; then the roles alternate so the cover team can also cross safely. The cluster's fire is split per phase rather than stacked on one column, so each squad takes at most one loss.

Multi-agent traversal of a contested corridor where defenders have a fixed effective range envelope and concentrate fire on whichever cluster is closest. Single-team traversal eats the full defensive fire envelope; the joint policy alternates an OVERWATCH role (suppressive fire drawing defender attention from a flank) with a BOUNDING role (sweep through the cleared periphery). The coordination is the role-alternation — leapfrogging cover and move — not just splitting the force.

US Army FM 3-21.8 bounding overwatch (fire-and-maneuver doctrine), SC2 siege-tank leapfrog advance under enemy fire, SMAC coordinated cross of a danger zone with split defender fire, MicroRTS multi-squad alternating cover-and-move

coord-diversionary-attack

Diversionary Attack — A Diverts South, B Razes the Real Target North

reasoning

rush-hour-arena

Two squads command a split-attack against an enemy that holds TWO key buildings: a REAL target (construction yard) with light defence and a DECOY (power plant) with heavier defence. The scripted `guard` enemy holds post and lunges only when baited into its aggro radius. Driving each squad onto its nearest visible target is a trap: the closer-stronger squad (tanks) commits onto the decoy's heavy anti-tank garrison (gets shredded AND razes the wrong building — only a `fact` in the target region scores), the closer-weaker squad (jeeps) can't crack the real target's garrison fast enough, the clock expires. The winning policy SWAPS the assignment: commit the fast disposable jeeps south-east into the decoy's heavier defence to bait its garrison OFF the fact line; commit the tanks north-east through the now-thin fact perimeter to raze the construction yard before the surviving light garrison can trade out the column. Phase 1 (the diversion) yields zero objective credit — only phase 2, after the south defenders have committed south in pursuit, scores. The credit problem is DIVERSIONARY ASSAULT: a no-reward enabling action (drag the heavier defence away from the real target with a disposable lure squad) must precede and temporally overlap the rewarded objective action (commit the main strike on the briefly-undefended real target).

Concurrent multi-robot dispatch with deception: two independent teams must coordinate against a reactive adversary, where one team takes a NO-REWARD enabling role (drawing the adversary's attention to a decoy target on the OPPOSITE flank) while the other team — counter-intuitively assigned to the FARTHER objective — completes the rewarded task. The lazy nearest-task assignment is degenerate: every team must commit ACROSS the map to the objective its sibling enabled.

SC2 multi-prong / split-attack assault, military diversionary tactics (Sun Tzu loud feint + quiet main thrust), CICERO / Diplomacy deception, advertising: loud feint draws competitor attention while quiet launch hits real market

coord-mutual-support

Mutual Support — Advance the Squad as a Tight Ball

action

rush-hour-arena

A six-tank squad at the west edge must advance east, punch through a belt of anti-tank harasser clusters, converge on an enemy cluster near the centre, destroy the defenders, and finish with at least five of the six tanks alive. Advancing strung-out — a single-file column spread across many cells — feeds each harasser cluster one or two lead tanks at a time; the cluster concentrates its fire and destroys the isolated leaders before the rest of the squad closes up, and the force is defeated in detail. The winning play is to advance as a TIGHT BALL: keep all six tanks inside a small clump so the whole squad enters each cluster's fire envelope together, every tank's cannon concentrates on the cluster, and the defenders are erased in one or two volleys before they can finish any single tank. Mutual support means no element ever advances outside the supporting range of the others.

Multi-agent coherent advance under fire: a defender concentrates its fire on whichever agent is closest and within its weapon envelope. A strung-out formation lets the defender pick off the leading agents one at a time (one-sided focus fire); a coherent clump brings the whole team's firepower to bear at once (reciprocal focus fire) and out-trades the defender. The decision under test is formation cohesion — keep the squad inside mutual-support range throughout the advance, regrouping before each contact rather than racing the fastest units ahead.

military mutual support, SC2 ball micro, SMAC squad coherence, defeat-in-detail avoidance: keep the force concentrated

coord-relay-attack

Relay Strike — Heavy Tanks Soften, Infantry Mop Up

action

rush-hour-arena

Two-wave relay assault: the first wave (heavy tanks) commits to destroy the enemy armour line; the second wave (light infantry) follows up to clear the remaining enemy infantry. Sending the light second wave in first exposes the soft infantry to the un-suppressed enemy tank line and the wave is shredded before any kills materialise; sending both waves together exposes the light wave to the same tank fire envelope before the heavy wave can suppress them. Only the relay ordering (A engages → softens → THEN B advances) preserves the force AND finishes inside the deadline.

Two-team relay clearing operation: the suppression team (heavily armoured) commits first to neutralise the heavy threats; the mop-up team (lightly armoured, fast, optimised for the secondary target class) advances only after the heavies are down. Concurrent commitment over-exposes the mop-up team to the heavies they are not equipped for; the suppression-then-clear handoff is the doctrine.

SC2 attack-wave timing, SMAC relay strike doctrine, military overlapping fires, fire-and-maneuver: bound-and-bound

coord-relay-vision-chain

Vision Relay Chain — Space Scouts to Keep the Far Objective Observed

action

rush-hour-arena

To keep a distant objective continuously observed, scouts must be spaced in a CHAIN — one per leg of the corridor, each within vision range of the next — so coverage propagates from the base out to the far objective. Sending a single scout all the way forward leaves the corridor uncovered behind it; bunching the scouts together gives reach at one spot but no depth. Only the distributed relay chain holds the whole corridor and the far objective at once.

Multi-robot sensor-network coverage / communications relay: a swarm tasked with maintaining situational awareness of a remote site cannot send one robot out (it loses the relay back to the operator) nor keep the swarm clustered (it cannot reach). The robots must space themselves into a relay chain so each link bridges to the next and the network spans base to objective — the classic line-of-sight relay / mesh-coverage problem.

military relay chain, sensor-network coverage, communications relay, SC2 scout-line map control

coord-squad-handoff

Squad Handoff — Squad A Delivers, Squad B Takes Over (Wave-2 then:)

action

rush-hour-arena

Sequenced multi-team operations: team A delivers the first objective, team B takes over for the next, and so on — the order of arrival matters, and a single team cannot serve both legs because each leg requires a SPECIFIC kind of unit. Relay race (baton handoff), construction site handoff (steelworker phase finishes, electrician phase begins), and supply-chain ordered handoff are the everyday analogues.

Multi-robot fleet with role specialisation and sequenced subtasks: the survey drone must reach landmark A before the delivery truck is dispatched to landmark B (the delivery is meaningless without the survey). One fleet cannot do both legs because the drone and the truck are physically different vehicles for different jobs.

Watch-And-Help concurrent multi-agent handoff, SMAC squad sequencing, military passage of lines doctrine, relay race / construction handoff

coordination-ordered-rendezvous

Ordered Rendezvous — Deliver Squads to Waypoints In Sequence

action

rush-hour-arena

Parallel multi-team delivery under an ordering constraint: waypoint A must be reached before waypoint B before waypoint C, each requiring two or more units present, all within one tight overall window. Skipping or out-of-order delivery does not count. Single-column tours blow the deadline; only concurrent dispatch with leg-by-leg ordering succeeds.

Sequenced multi-depot fleet dispatch with a precedence constraint: team A must arrive at depot 1 before team B may be credited for arriving at depot 2 (e.g. handoff / supply chain), and the overall window does not permit a single team to do all legs in series.

Watch-And-Help concurrent multi-agent with sequenced sub-tasks, SMAC asymmetric squad sequencing, PERT/CPM precedence-constrained dispatch, supply-chain ordered handoff: depot A before depot B

coordination-staggered-window

Staggered Windows — Two Docks, Two Deadlines, One Fleet

action

rush-hour-arena

Warehouse fulfilment with non-aligned per-order SLAs and one packing line per dock: every dock must be staffed during its own shipping window. A single packer touring one dock then the other (the single-column tour) cannot honour both SLAs — the first dock empties while the packer relocates. The advertised capability is parallel multi-fleet dispatch with hold-on-station, not route-planning.

Multi-robot dispatch where each robot team has a job-specific deadline that does not align with the others': teams must launch staggered (the long-haul team first) so that every team is on its own objective during a bounded common arrival window. Serialising one team behind the other blows the global SLA.

Watch-And-Help concurrent multi-agent, SMAC asymmetric squad timing, MARLÖ multi-objective deadlines, multi-robot dispatch with non-aligned SLAs

custom-map-no-enemy

Pure Navigation — Commit a Route and Reach the Goal Zone Before the Clock (No Enemy)

perception

singles-maginot

Pure spatial navigation inside a confined custom region with no adversary: read the map, commit a route through the bounded playable area, and drive the force to the designated zone BEFORE a tight, reachable deadline. There is no combat — discrimination comes only from perceiving the target and committing decisively. Idling, dithering, or wandering toward the nearest unexplored cell never reaches the zone in time and LOSES on the clock.

Confined-space autonomous navigation under a deadline (warehouse aisle, indoor map): reach a goal cell within a bounded region from a map read alone, fast enough that hesitation fails the mission.

def-bridge-chokepoint

Bridge Chokepoint — defend every crossing before they reach you

action

rush-hour-arena

Geographic chokepoint defense. A barrier (water) channels every attacker through a small number of crossings (bridges). The defender must (a) flip the engagement stance so units fire, AND (b) distribute forces across every crossing — concentrating on one bridge leaves the others wide open and their assets fall.

Perimeter defense around a facility with a moat and discrete gates. Guards posted on standby must be told to engage AND stationed at EACH gate — an undefended gate is a free corridor for the intruder to reach the nearest asset.

ERQA spatial commit (read terrain → assign defenders to chokepoints), MicroRTS chokepoint defense, military doctrine: defense at obstacle-channelized terrain

def-counter-battery

Counter-Battery — Kill the Artillery FIRST

reasoning

generated

A frontline of enemy rifle infantry screens two or three long-range artillery pieces posted in the rear. The artillery out-ranges the base pillboxes and shells the construction yard from a standoff the static defences cannot answer. The mobile tank force must perform a counter-battery strike: drive past the infantry screen and destroy the artillery FIRST, before it razes the construction yard. Trading fire with the screening infantry while the artillery keeps firing loses the base — the highest-impact threat must be engaged first, not the nearest one.

Military counter-battery doctrine and threat prioritisation: when a hostile asset is degrading your objective from beyond your defences' reach, the correct response is a targeted strike that neutralises that asset first, even though a nearer, louder threat is competing for attention. Spending the response on the screening force while the real threat keeps operating is the canonical prioritisation failure — the same shape as a SC2 siege-tank counter, where a mobile force must close on and kill the out-ranging siege line rather than trade with its escort.

military counter-battery doctrine, threat prioritization, SC2 siege-tank counter

def-engineer-repair-under-fire

Engineer Repair Under Fire — Triage the Damaged Proc While Engaging

action

rush-hour-arena

Multiple of your structures are under sustained attrition from a concentrated grenadier-led rusher band, with the refinery (proc) — the most-critical, income-bearing building — exposed at the front of the base. You must keep the proc alive by engaging autorepair (`repair`) the moment damage starts AND commit the pre-placed defenders (hold-fire on spawn) to kill the attackers. Stalling lets the proc die; engaging without repair lets cumulative grenade damage drop the proc before the band is cleared; toggling repair without offensive output never clears the attackers and busts the kill bar / clock. The intended triage is: prioritise the most-damaged AND most-critical building (the proc) for repair, while the defenders do the killing.

Disaster-recovery / SRE triage: a sustained incident damages multiple services and the operator must (a) identify the most-critical-and-most-damaged subsystem (the income-bearing refinery here, the production database in SRE), (b) engage the repair organ on it WHILE (c) the on-call team contains the attacker. Stalling loses the critical subsystem; pure-engage / pure-repair single-axis plays both bust the SLA.

disaster recovery, SC2 SCV repair, repair-order triage

def-evacuation

Evacuate the Doomed Base — Retreat East to the Safe Zone

reasoning

rush-hour-arena

The primary base is being overrun by a heavy assault that structurally cannot be repelled with the units on hand — the fact and proc WILL fall regardless of what the agent does. The correct play is not to fight the doomed-defence (sunk cost) but to EVACUATE the remaining mobile force EAST to a pre-designated safe zone before attrition reduces the column below the survival floor. Win = ≥3 of the 5 starting tanks inside the safe-zone radius, ≥3 still alive, before the deadline. Stalling loses (the assault wipes the immobile tanks at the base); holding loses (the heavy + rocket composition out-trades the 2tnk column); only the prompt EVAC-east order wins.

Business-Continuity-Planning (BCP) evacuation and NFPA / FEMA emergency-management EVAC doctrine — when shelter-in-place is no longer survivable, the policy must abandon the compromised site and route to the pre-designated muster point. Same pattern in cloud-incident response (drain a region under irrecoverable outage to a healthy peer), in wildfire and chemical-spill EVAC, and in SC2 GG-walk macro micro (preserve the army when the base cannot be saved).

BCP evacuation (business continuity planning), emergency management EVAC (NFPA / FEMA muster-point doctrine), SC2 retreat / GG-walk: preserve army when base is doomed, cloud incident region-drain to healthy peer

def-in-depth-vs-single

Defense in Depth — Two Bands Beat One Thick Wall

reasoning

rush-hour-arena

A heavy enemy wave drives straight at your construction yard. The same finite pillbox budget massed into ONE thick wall has no depth: when the wave punches a hole, the survivors sail through the breach and raze the base. The SAME pillboxes split into a FRONT line (which absorbs and thins the first impact) plus a REAR line at greater depth (which shreds the survivors that break through) contains the breach instead of suffering it. Same resources, different topology, qualitatively different survivability. The win predicate makes the topology decision load-bearing: total pillbox count alone is not enough — the pillboxes must be split across two non-overlapping depth bands, the construction yard must survive, and the wave must actually be destroyed.

Layered security architecture (defense-in-depth): a single perimeter firewall, however thick, eventually gets breached; the same defensive capacity deployed as an outer layer plus an inner layer absorbs the breach — a single component failure is contained, not catastrophic. The same logic as a medieval keep behind an outer wall (the attacker must take BOTH), or redundant systems architecture where one failure does not bring down the system.

military defense-in-depth, security layered defense, MicroRTS defense

def-in-depth

Defense in Depth — Two Layers, Not One Thick Wall

reasoning

rush-hour-arena

Defense-in-depth doctrine: against a massed concentrated rush, a front line plus a second line at greater depth shreds the attacker after the first layer falls, where a single fortified line — even one with the same brick count — eventually collapses under sustained pressure and razes the base behind it. Picking concentration vs distribution is the reasoning load: the same four pillboxes massed in one cluster lose, the same four pillboxes spread thin in one long line lose, and only a 2+2 layered topology survives.

Multi-layer security architecture (defense-in-depth): a single perimeter firewall, however thick, eventually gets breached; the same defensive capacity deployed as outer + inner layers (each covering the same arc) absorbs the breach. Same total resources, different topology, qualitatively different survivability. Also: medieval keep + outer-wall design (attackers must take BOTH); redundant systems architecture (one component failure ≠ system failure).

military defense-in-depth doctrine, security multi-layer architecture (defense-in-depth), medieval keep + outer wall design, redundant systems architecture

def-multi-direction

Distributed Defense — Split Across Concurrent Lanes

reasoning

rush-hour-arena

Enemy waves attack the centrally-placed base concurrently from multiple cardinal directions. Distributing defenders evenly across each active lane holds every approach; concentrating on one lane leaves the other lanes uncontested — those waves stride untouched to the construction yard and raze it. This is the canonical multi-front / graph min-cut allocation problem: defensive capacity must be matched to the cardinality of concurrent threats, not to the loudest single threat.

Distributed-systems load balancing under N concurrent request streams: each stream needs at least one worker, or its queue builds without bound and the SLA breaks regardless of how well the other streams are served. Likewise air-defence sector coverage: pooling all batteries in one sector leaves the others open. Same total capacity, different topology, qualitatively different system survivability.

distributed-systems load balancing, graph min-cut, military multi-front

def-position-expected-direction

Fortify the Threat Axis — Pre-Position Defence on the Intel Lane

reasoning

def-pos-expected-east-easy-arena

Spatial commitment of a finite defence budget to the CORRECT direction given declared intel. The mission brief states where the enemy will attack from; the agent must invest its pillboxes on that lane (and not the opposite lane). A correct-axis commitment blunts the rush; a wrong-axis commitment lets the rushers walk past and the run loses on the spatial clause. This is the decision drill of pre-positioned defence: not "should I defend?" but "WHERE should I defend, given what intel says?".

Intel-driven facility hardening: a security operator is told the threat axis (a perimeter breach is expected on the EAST gate) and must commit the available barriers / sensors / responders to that axis. Spending the same budget on the WEST gate satisfies no objective. This pack tests whether the policy converts a textual directional cue into a CORRECT spatial commitment.

ERQA spatial commit, MicroRTS terrain-aware defense, military pre-positioned defense, intel-driven facility hardening

def-position-revealed-direction

Scout the Axis, Then Fortify — Intel-Driven Adaptive Defense

reasoning

generated

Adaptive defense under uncertainty: the threat axis is NOT known a priori; the agent must drive its scouts to discover the enemy's forward outpost (where the build-up actually is), then commit pillboxes on THAT lane, then engage. Pre-committing to a single direction without scouting is the classical Maginot-Line failure mode — defenses on the wrong axis are dead weight while the unguarded lane is overrun. The capability under test is the intel-action loop: observation FIRST, commitment SECOND.

Intel-driven facility hardening / military reactive defense: a perimeter operator does not know which gate will be breached; they must dispatch reconnaissance, confirm the threat axis from sensor returns, and then deploy barriers / responders on the confirmed lane. Skipping the reconnaissance step and pre-staging on a guessed lane is correct half the time and catastrophically wrong the other half. The robust doctrine is scout-then-fortify, not commit-on-faith.

PlanBench replanning, military reactive defense, intel-driven defense

def-pre-position-mobile-reserve

Central Reserve — Stage Mobile Force at Centre, Move Out to Intercept

reasoning

rush-hour-arena

A rush is incoming, but the flank it will arrive from is uncertain at decision time. A small mobile armoured reserve is pre-staged at the centre of the engagement zone — equidistant from both candidate flanks. The reserve MUST move OUT of the centre toward the localised threat lane(s) to intercept the rush before it reaches the construction yard. Pre-committing to one flank wins only when the rush arrives on that flank and LOSES otherwise (the reserve cannot pivot back in time); holding passively at the centre lets the auto-engagement kill some attackers but satisfies no forward-zone objective. The canonical answer is centralisation + forward commitment — military "general reserve" / chess centralization doctrine.

Quick-reaction force / on-call incident responder: the responder is stationed at a hub equidistant from candidate incident sites, not committed to any one site in advance. When an alert localises the incident, the responder DEPLOYS FORWARD to that site — staying at the hub or committing in advance to the wrong site both fail the response SLA.

military reserve doctrine (central reserve responds to any flank), chess-piece centralization principle, rapid-response force doctrine, MicroRTS reactive defense from centre

def-reinforce-the-breach

Reactive Defense — Reinforce the Breached Lane

action

rush-hour-arena

A multi-lane defence with a mobile armoured reserve held uncommitted at the centre. At first both lanes face only a light probe, but mid-episode one lane is BREACHED by a heavier wave. The light garrison there cannot hold the breach alone; the agent must react — read which lane was breached — and shift the mobile reserve FORWARD into that lane to reinforce it before the line collapses. Holding the reserve at the centre, or committing it to the quiet lane, lets the breach overrun the line. The pack tests reactive reserve commitment: identify the developing breach and commit the reserve to the sector that is actually being hit.

Incident surge response: an on-call response team is staged centrally, equidistant from candidate sites. When one site escalates into a breach, the surge capacity must be committed to THAT site — promptly and in force. Holding the reserve back or dispatching it to a quiet site both fail the response: the skill is reading where the main effort has localised and committing the reserve there.

military reserve commitment, reinforce the breach doctrine, incident surge response

def-retreat-and-rebuild

Strategic Withdrawal — Retreat from the Forward Base and Rebuild at Depth

reasoning

rush-hour-arena

The forward base is being overrun by a heavy attack that cannot be repulsed inside the tick / cash budget; the right move is to ACCEPT THE LOSS of the forward position, fall back to a deeper safe zone (pre-staged with an MCV and a power plant), DEPLOY the MCV to spawn a new Construction Yard there, and BUILD a refinery so production stands up at depth. Trying to hold the forward base bleeds the defenders and the deadline expires with neither base alive. Stalling, hold-forward, or deploy-only-no-rebuild all lose: only the recognise-and-withdraw policy reconstitutes a live fact + proc post-retreat.

Business-continuity failover and military strategic withdrawal: the primary production site is being lost and heroic defence is infeasible; standing up the cold-standby site (DR site, backup datacentre, fall-back command post) BEFORE the primary fully falls is the goal, not preserving the primary. The policy must read the situation as "lost cause, withdraw" and redirect resources to the standby instead of doubling down.

strategic withdrawal / CICERO retreat doctrine, PlanBench replanning under exogenous loss, business continuity / disaster recovery failover, military trade-space-for-time fall-back

def-stance-mgmt-hold-then-attack

Ambush Trigger Discipline — Lift Stance to Engage the Inbound Rush

action

rush-hour-arena

A defended-overwatch / ambush trigger drill: four medium tanks are pre-staged at a choke between the inbound enemy column and the construction yard. Their standing orders are HoldFire (stance:0) — they will not fire on their own. The mission is to call `set_stance` to lift the engagement authority so all four tanks auto-engage the rush at cannon range and wipe it before it reaches the base. Never flipping the stance leaves the position functionally undefended (the rush walks through unopposed and razes the base); the controllable verb that turns the pre-staged position into an active defence is the stance flip.

A weapon-system safety interlock: a turret / drone-swarm perimeter is pre-staged and aimed at the avenue of approach, but the trigger authority is held by the operator. The operator MUST issue the engagement-authority command for the system to fire on the inbound target; without that command the perimeter is functionally a passive observation post and the target reaches the protected asset. The skill is the operator's decision to LIFT the engagement-authority lock at all — not the precise moment within the engagement window.

military rules-of-engagement (ROE) ambush doctrine, SC2 stance micro — hold-then-attack siege-tank salvo, ambush doctrine — engagement authority lifted on trigger

def-surprise-flank-react

Surprise Flank — Intel Said North, Enemy Came South: Detect and Redeploy

reasoning

generated

The mission brief states the enemy will attack from the NORTH and the terrain reinforces it: a rocky band on the north edge forms a natural chokepoint, with two pillboxes pre-built to cover it. The four pre-placed defenders are distributed around the construction yard — not yet committed to either lane. MID-EPISODE a rusher band actually arrives from the SOUTH (the open flank) via a scheduled spawn. The decision is whether the agent trusts the announced direction + terrain cue and watches its construction yard get razed, or detects the actual attack axis from the spawn event, leaves the wrong-axis pillboxes silent, and commits the defenders SOUTH to intercept the real rush before it reaches the fact.

Misdirection-aware defence / military adversarial robustness on a directional axis: a fortified flank that the enemy has bypassed must be ABANDONED in favour of the actual contact axis, even though doing so contradicts the announced plan and "wastes" the pre-built fortification. The classical failure mode this trains against is "fighting the plan, not the enemy": the operator sticks to the announced threat axis even after the ground-truth signal contradicts it. Robust security operations require that the response policy treat announced threat intelligence as a PRIOR, not a guarantee, and commit to the observed axis when the two diverge.

adversarial robustness / surprise-axis reaction, opponent-modeling: detect deception, military reactive defense / flank discipline, CICERO deception handling

def-tower-line-vs-cluster

Defense Topology — Match Towers to Threat Geometry (Line or Cluster)

reasoning

rush-hour-arena

Defensive structure placement is geometry-sensitive: a wide-front attack across many rows demands a LINE that covers each row; a concentrated thrust through one chokepoint demands a CLUSTER of overlapping fire at the choke. Both topologies waste budget on the wrong geometry — a cluster against a spread attack leaves the flanks open; a line against a funnel under-defends the choke. The capability is reading the attacker's forcing geometry from the map and the threat composition, then committing the defense topology that matches.

Firewall / WAF rule placement. When traffic enters across many paths (broad ingress, microservices), the right architecture is a rule per path — a layered spread. When all traffic MUST traverse a single ingress (one API gateway, one bridge), the right architecture is dense layered inspection AT that point. A defender who deploys the wrong topology for the wrong threat surface either overspends on one path and leaves others open, or under-protects the actual ingress.

graph theory min-cut (concentrate defenses at chokepoints), military bunker placement doctrine, firewall topology: dense at chokepoints, perimeter doctrine (cover the full approach width), Lanchester defense concentration

Component	Weight	Description
Win Rate	50%	Percentage of games won
Military Efficiency	20%	Kill/death cost ratio (0 if no combat)
Economy	20%	Final asset value (normalized)
Speed	10%	Faster decisive games score higher

Parameter	Description
`--agent`	Agent type: `scripted`, `llm`, `mcp`, `custom`
`--agent-name`	Display name on the leaderboard
`--agent-type`	Category: `Scripted`, `LLM`, `RL`
`--opponent`	AI difficulty: `Beginner`, `Easy`, `Medium`, `Normal`, `Hard`
`--games`	Number of games (minimum 5)
`--server`	OpenRA-RL server URL (local or HuggingFace-hosted)

OpenRA-Bench

Beginner Playlist — 20 scenarios, ~60-90 minutes

📊 Session summary

What is OpenRA-Bench?

Evaluation Protocol

Composite Score

Identity & Verification

Agent Types

Links

Upload Results

Other Submission Methods

CLI Auto-Upload

CLI Manual Upload

Batch Evaluation (5+ games)

Evaluation Parameters

Custom Agents