---
title: Agentic Codenames Arena
emoji: 📊
colorFrom: blue
colorTo: blue
python_version: 3.12.6
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
short_description: Time for the LLM to have some fun with Codenames!
tags:
  - mcp-in-action-track-creative
---


# 🧠 Agentic Codenames Arena

![Meme](assets/meme.png)

**Watch, or join, LLMs battling it out in Codenames.**

[Demo on YouTube](https://youtu.be/E3IvBN8SqdA)

[My post on LinkedIn](https://www.linkedin.com/posts/luca-di-palma-99024a1b7_most-of-us-use-llms-to-create-reports-write-activity-7400225424770932736-OTPU?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJnVPwBh-8LoV25AQVeclIBTKNuOP6rr08)

---

## 🧩 What This App Does

**Agentic Codenames Arena** is an interactive dashboard where teams of LLMs compete in the game of *Codenames*.
Two team, **Red** and **Blue**, face off in a **4v4 setup**, with each team composed of:

* **1 Boss**: Provides the clue and clue number for each turn.
* **1 Captain**: Coordinates the team’s reasoning, synthesizes the agents’ suggestions, and ultimately selects the final words to “touch”.
* **2 Players**: Collaborate with the Captain, proposing interpretations, evaluating associations, and contributing to the team’s final decisions.


The internal **communication and coordination architecture is built using LangGraph**, enabling structured multi-agent reasoning and transparent agent-to-agent interactions.

Below is the LangGraph diagram illustrating how the different roles communicate during each turn:

![LangGraph Architecture](graph.png)

You can either **sit back and watch fully autonomous LLM teams play**, or **step in as a human Boss** to lead your AI teammates with your own clues.

---

## 🤖 How It Works

### **LLM Teams**

Build teams from several providers: OpenAI, Google, Anthropic, HuggingFace...
Each model plays autonomously using its own reasoning chain and game strategy.

### **Two Gameplay Modalities**

#### **1️⃣ Observation Mode — Watch AIs Battle**

Sit back and spectate.
See how different models reason about clues, decide associations, and occasionally produce *hilariously misaligned* guesses.

You'll see:

* Model-to-model conversations
* Reasoning traces
* Turn-by-turn decisions
* How each team coordinates across multiple rounds

Perfect for AI benchmarking, research, or just entertainment.

#### **2️⃣ Human Boss Mode — Enter the Fight**

Become the Boss for either team and give your own clue + number.
Your AI teammates will interpret your hint and take their guesses.

---

## 🧠 Why It’s Interesting

* **Compare LLM reasoning styles:**
  Watch how different models interpret associations, analogies, and subtle semantic cues.

* **Analyze team dynamics:**
  Some models coordinate beautifully. Others… not so much.
  Observe emergent cooperation, miscommunication, or unexpected strategies.

* **Experiment with human–AI collaboration:**
  Test how effective your clues are with LLM teammates.
  Try pushing the limits with creative, cryptic, or minimalist hints.

---

## 🕹️ Main Features

* **Create & customize teams** using any mix of LLMs
* **Switch between AI vs AI** and **Human vs AI** modes
* **Detailed per-turn logs** for all model decisions
* **Transparent reasoning chains**
* **Interactive UI** for watching matches play out
* **Match history & analytics dashboard**

---

## 📊 Stats & Analytics

All games played in the Arena are stored in a database.
The Stats section of the app includes:

* **Model win/loss rates** across all recorded matches
* **Performance comparisons** between model families (OpenAI vs Google vs …)
* **Historical match logs** for replay & analysis
* **Leaderboards** highlighting the best-performing models

This turns the Arena into a dynamic benchmarking tool for evaluating LLM semantic reasoning, coordination abilities, and reliability under pressure.