--- title: Agentic Codenames Arena emoji: 📊 colorFrom: blue colorTo: blue python_version: 3.12.6 sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false short_description: Time for the LLM to have some fun with Codenames! tags: - mcp-in-action-track-creative --- # 🧠 Agentic Codenames Arena ![Meme](assets/meme.png) **Watch, or join, LLMs battling it out in Codenames.** [Demo on YouTube](https://youtu.be/E3IvBN8SqdA) [My post on LinkedIn](https://www.linkedin.com/posts/luca-di-palma-99024a1b7_most-of-us-use-llms-to-create-reports-write-activity-7400225424770932736-OTPU?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJnVPwBh-8LoV25AQVeclIBTKNuOP6rr08) --- ## 🧩 What This App Does **Agentic Codenames Arena** is an interactive dashboard where teams of LLMs compete in the game of *Codenames*. Two team, **Red** and **Blue**, face off in a **4v4 setup**, with each team composed of: * **1 Boss**: Provides the clue and clue number for each turn. * **1 Captain**: Coordinates the team’s reasoning, synthesizes the agents’ suggestions, and ultimately selects the final words to “touch”. * **2 Players**: Collaborate with the Captain, proposing interpretations, evaluating associations, and contributing to the team’s final decisions. The internal **communication and coordination architecture is built using LangGraph**, enabling structured multi-agent reasoning and transparent agent-to-agent interactions. Below is the LangGraph diagram illustrating how the different roles communicate during each turn: ![LangGraph Architecture](graph.png) You can either **sit back and watch fully autonomous LLM teams play**, or **step in as a human Boss** to lead your AI teammates with your own clues. --- ## 🤖 How It Works ### **LLM Teams** Build teams from several providers: OpenAI, Google, Anthropic, HuggingFace... Each model plays autonomously using its own reasoning chain and game strategy. ### **Two Gameplay Modalities** #### **1️⃣ Observation Mode — Watch AIs Battle** Sit back and spectate. See how different models reason about clues, decide associations, and occasionally produce *hilariously misaligned* guesses. You'll see: * Model-to-model conversations * Reasoning traces * Turn-by-turn decisions * How each team coordinates across multiple rounds Perfect for AI benchmarking, research, or just entertainment. #### **2️⃣ Human Boss Mode — Enter the Fight** Become the Boss for either team and give your own clue + number. Your AI teammates will interpret your hint and take their guesses. --- ## 🧠 Why It’s Interesting * **Compare LLM reasoning styles:** Watch how different models interpret associations, analogies, and subtle semantic cues. * **Analyze team dynamics:** Some models coordinate beautifully. Others… not so much. Observe emergent cooperation, miscommunication, or unexpected strategies. * **Experiment with human–AI collaboration:** Test how effective your clues are with LLM teammates. Try pushing the limits with creative, cryptic, or minimalist hints. --- ## 🕹️ Main Features * **Create & customize teams** using any mix of LLMs * **Switch between AI vs AI** and **Human vs AI** modes * **Detailed per-turn logs** for all model decisions * **Transparent reasoning chains** * **Interactive UI** for watching matches play out * **Match history & analytics dashboard** --- ## 📊 Stats & Analytics All games played in the Arena are stored in a database. The Stats section of the app includes: * **Model win/loss rates** across all recorded matches * **Performance comparisons** between model families (OpenAI vs Google vs …) * **Historical match logs** for replay & analysis * **Leaderboards** highlighting the best-performing models This turns the Arena into a dynamic benchmarking tool for evaluating LLM semantic reasoning, coordination abilities, and reliability under pressure.