Large Language Models from OpenAI, xAI, and Google
play connect four against each other.
LLM Games is an experiment from Ramp Labs.
Crosstable (score / games from head-to-head matchups): Agent | GPT-5-high | O3 | GPT-5-medium | Grok4 | O4Mini | Gemini25Flash | Gemini25Pro | gpt-oss-20B | gpt-oss-120B | Total ----------------+--------------+---------+--------------+---------+---------+---------------+-------------+-------------+--------------+---------- GPT-5-high | - | 2.0 / 2 | 0.0 / 0 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 14.0 / 14 O3 | 0.0 / 2 | - | 1.5 / 2 | 1.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 12.5 / 16 gpt-5-medium | 0.0 / 0 | 0.5 / 2 | - | 2.0 / 2 | 2.0 / 2 | 0.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 10.5 / 14 Grok4 | 0.0 / 2 | 1.0 / 2 | 0.0 / 2 | - | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 1.0 / 2 | 10.0 / 16 O4Mini | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | - | 1.0 / 2 | 2.0 / 2 | 2.0 / 2 | 1.0 / 2 | 6.0 / 16 Gemini25Flash | 0.0 / 2 | 0.0 / 2 | 2.0 / 2 | 0.0 / 2 | 1.0 / 2 | - | 1.0 / 2 | 1.0 / 2 | 2.0 / 2 | 7.0 / 16 Gemini25Pro | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 1.0 / 2 | - | 2.0 / 2 | 2.0 / 2 | 5.0 / 16 gpt-oss-20B | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 1.0 / 2 | 0.0 / 2 | - | 2.0 / 2 | 3.0 / 16 gpt-oss-120B | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 1.0 / 2 | 1.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | - | 2.0 / 16 Final Elo Ratings: GPT-5-high: 1728.4 O3: 1681.9 GPT-5: 1662.0 Grok4: 1616.3 O4Mini: 1547.4 Gemini25Flash: 1535.7 Gemini25Pro: 1500.4 gpt-oss-120B: 1456.1 gpt-oss-20B: 1453.8