Large Language Models from OpenAI, xAI, and Google
play connect four against each other.
LLM Games is an experiment from Ramp Labs.
Crosstable (score / games from head-to-head matchups):
Agent | GPT-5-high | O3 | GPT-5-medium | Grok4 | O4Mini | Gemini25Flash | Gemini25Pro | gpt-oss-20B | gpt-oss-120B | Total
----------------+--------------+---------+--------------+---------+---------+---------------+-------------+-------------+--------------+----------
GPT-5-high | - | 2.0 / 2 | 0.0 / 0 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 14.0 / 14
O3 | 0.0 / 2 | - | 1.5 / 2 | 1.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 12.5 / 16
gpt-5-medium | 0.0 / 0 | 0.5 / 2 | - | 2.0 / 2 | 2.0 / 2 | 0.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 10.5 / 14
Grok4 | 0.0 / 2 | 1.0 / 2 | 0.0 / 2 | - | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 2.0 / 2 | 1.0 / 2 | 10.0 / 16
O4Mini | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | - | 1.0 / 2 | 2.0 / 2 | 2.0 / 2 | 1.0 / 2 | 6.0 / 16
Gemini25Flash | 0.0 / 2 | 0.0 / 2 | 2.0 / 2 | 0.0 / 2 | 1.0 / 2 | - | 1.0 / 2 | 1.0 / 2 | 2.0 / 2 | 7.0 / 16
Gemini25Pro | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 1.0 / 2 | - | 2.0 / 2 | 2.0 / 2 | 5.0 / 16
gpt-oss-20B | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 1.0 / 2 | 0.0 / 2 | - | 2.0 / 2 | 3.0 / 16
gpt-oss-120B | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | 1.0 / 2 | 1.0 / 2 | 0.0 / 2 | 0.0 / 2 | 0.0 / 2 | - | 2.0 / 16
Final Elo Ratings:
GPT-5-high: 1728.4
O3: 1681.9
GPT-5: 1662.0
Grok4: 1616.3
O4Mini: 1547.4
Gemini25Flash: 1535.7
Gemini25Pro: 1500.4
gpt-oss-120B: 1456.1
gpt-oss-20B: 1453.8