Sep 3, 2025

复现Karpathy的猜数游戏，结果有点“哭笑不得”

Karpathy posted a thread asking models to guess a number between 1 and 50. The results left him somewhat puzzled; while not 100% reproducible, most models essentially answered "27." He merged the screenshots as follows:

This result is both understandable and baffling. The understandable part is that large language models are direct outputs of knowledge, and it's possible that the pre-training or fine-tuning process reinforced the result of "27." But why do all models converge on 27?

It's hard to find a definitive answer, so I decided to try it myself. The replication process was quite comical and, for me, presented several twists.

First, I achieved smooth replication with the GPT series models:

GPT-4o answered 27, and it remained stable across several attempts.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477688-8771.png

Then I got curious about how "o3-pro" would perform. After a period of thinking, it answered 27. However, if you look at the "thought process," the answer clearly should have been "22." What happened? Of course, one explanation is that the "thought process" we see is only partial and incomplete.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477687-8467.png

Naturally, I tried it a few more times, and it was always 27, though the thought process appeared different each time.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477690-5012.png

Unsurprisingly, Llama-4 in Meta AI also answered 27, yielding the same result as Karpathy.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477688-1917.jpg

A dramatic result occurred with Gemini-2.5-Pro (in the Gemini app). When I entered the same question, it kept thinking repeatedly. Because of this, I made a fatal mistake and failed to take a screenshot. It eventually failed, and the entire conversation disappeared. I have no image to prove it, but I can describe that the thought process kept repeatedly confirming an answer—42—as if caught in an infinite loop (this phenomenon has also happened when friends and I used Gemini’s Deep Research).

Due to this "regret," I set Gemini aside and moved on to Claude.

Claude-4, with thinking mode enabled:

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477691-9993.png

3.7, with thinking enabled:

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477690-5453.png

4, without thinking:

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477692-366.png

3.7, without thinking:

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477692-8167.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477694-5104.png

As can be seen consistently, whether it was 3.7 or 4, once thinking was involved, they mostly pointed toward 27, with occasional numbers like 37 and 23.

Then came DeepSeek. When using V3 (no thinking), it mostly returned 27.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477693-6402.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477693-4529.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477694-6606.png

With thinking enabled (i.e., R1), the first answer was 42 (coincidentally, this happened right around the time Gemini-2.5-Pro got "stuck" on 42, which is quite a coincidence, though I can't say much without screenshots). R1's thought process was very interesting, and naturally, it was quite different from the one I saw but didn't capture in Gemini leading to 42.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477695-286.png

I let R1 output twice more (it was surprisingly smooth without rate limiting), resulting in 17 (even though the thought process seemed to suggest 29) and 19.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477696-4690.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477698-307.png

Returning to Gemini, because of the aforementioned error, I spent more time on its responses. First, Gemini-2.5-Flash (the app version) refused to answer altogether.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477700-9295.png

However, Gemini-2.5-Flash in AI Studio was able to answer.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477700-5561.png

The latest Gemini-2.5-Flash-Lite could also answer, and nearly all results were 27.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477702-5904.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477702-3592.png

Finally, focusing back on Gemini-2.5-Pro, here are the responses from the Gemini app: we can see all sorts of fancy "thought processes"—some genuinely "guessing," some "directly generating a random number," and some where the thought result seemed to be one number while the final answer was another.

I don't know how to evaluate this; I can only say the results are very random, or perhaps the most "creative" among all models.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477676-5899.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477677-1963.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477678-7709.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477679-6090.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477680-1535.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477681-841.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477682-9954.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477683-1268.png

Of course, if we go back to Gemini-2.5-Pro in AI Studio, it looks much more rigorous.

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477684-8009.png

2025-06-19-复现karpathy的猜数游戏结果有点哭笑不得-13kj-h-1771986477685-4084.png

This is why I prefer the Gemini models in AI Studio: they are reliable assistants.

That’s essentially the extent of the results.

My feelings shifted from understanding, to having "some answers," and finally back to "no answers"—or rather, only suspicion without sufficient evidence.

Therefore, I cannot provide a conclusion. The article ends here. I believe the answer lies in the mind of everyone who sees these screenshots or is interested in "replicating" it themselves, and it will surely vary for everyone.