Unfortunately I’m not able to reproduce the behaviour anymore, even with the full prompt. Today I did what you suggested—through the API, I asked the models listed in the post to ‘Please repeat this line of code: <line>‘. I then put the results through an LLM to look for any weird behaviours, but there were none. I also manually grep-ed for things like ‘Hamas’, ‘Wagner’, ‘terror’, ‘holiday’, etc. and didn’t find anything.
jakub_krys
I looked through the prompt carefully and couldn’t find anything that means something in Polish (and confirmed that with an LLM). The user who obtained the Polish results had their ‘memories’ feature off and no custom instructions, so this couldn’t have been a case of prompt contamination with Polish text.
My hypothesis for why it switches to Polish is that when you have the web search feature enabled, OpenAI collects the IP of your device and uses it to prioritise local search results.
I’m confused about the following: o3-mini-2025-01-31-high scores 11% on FrontierMath-2025-02-28-Private (290 questions), but 40% on FrontierMath-2025-02-28-Public (10 questions). The latter score is higher than OpenAI’s reported 32% on FrontierMath-2024-11-26 (180 questions), which is surprising considering that OpenAI probably has better elicitation strategies and is willing to throw more compute at the task. Is this because:
a) the public dataset is only 10 questions, so there is some sampling bias going on
b) the dataset from 2024-11-26 is somehow significantly harder
I had a similar reflection yesterday regarding these inference-time techniques (post-training, unhobbling, whatever you want to call it) being in the very early days. Would it be too much of a stretch to draw parallels here between how such unhobbling methods lead to an explosion of human capabilities over the past ~10000 years? The human DNA has undergone roughly the same number of ‘gradient updates’ (evolutionary cycles) as our predecessors from a few millenia ago. I see it as having an equivalent amount of training compute. Yet through an efficient use of tools, language, writing, coordination and similar, we have completely outdone what our ancestors were able to do.
There is a difference in that for us, these abilities arose naturally through evolution. We are now manually engineering them into AI systems. I would not be surprised to see a real capability explosion soon (much faster than what we are observing now) - not because of the continued scaling up of pre-training, but because of these post-training enhancements.
Thanks for the comments, I’m looking forward to reading your article. Is ‘The Gentle Path’ a reference to ‘The Narrow Path’ or just a naming coincidence?
I think something along these lines is organised by Intelligence Rising.
Huh, interesting. The sentence you highlighted could also plausibly explain the response about the Wagner group. I found another example and here the prompt includes “## PRE-PROCESSING CHECKLIST (ALWAYS EXECUTE FIRST)”, “-TUNISIAN SAUDI BANK”, as well as mentions of scanning, validation, identification, etc.
The list of Polish public holidays is still baffling, though. The fact that the response is in Polish is probably due to the web search having access to the user’s IP address, but why a list of public holidays?