Just to give an example, here is the kind of prompt I am thinking of. I am being very specific about what I want, I think there is very little room for misunderstanding about how I expect the program to behave:
Write a Python program that reads a .olean file (Lean v4.13.0), and outputs the names of the constants defined in the file. The program has to be standalone and only use modules from the python standard library, you cannot assume Lean to be available in the environment.
o3-mini gives pure garbage hallucination for me on this one, like it’s not even close.
What does “obscure” mean here? (If you label the above “obscure”, I feel like every query I consider “non-trivial” could be labeled obscure.)
I don’t think Lean is obscure, it’s one of the most popular proof assistants nowadays. The whole Lean codebase should be in the AIs training corpus (in fact that’s why I deliberately made sure to specify an older version, since I happen to know that the olean header changed recently.) If you have access to the codebase, and you understand the object representation, the solution is not too hard.
import sys, struct
assert sys.maxsize > 2**32
f = sys.stdin.buffer.read()
def u(s, o, l): return struct.unpack(s, f[o:o+l])
b = u("Q", 48, 8)[0]
def c(p, i): return u("Q", p-b+8*(i+1), 8)[0]
def n(p):
if p == 1: return []
assert u("iHBB", p-b, 8)[3] == 1 # 2 is num, not implemented
s = c(p, 1) - b
return n(c(p, 0)) + [f[s+32:s+32+u("Q", s + 8, 8)[0]-1].decode('utf-8')]
a = c(u("Q", 56, 8)[0], 1) - b
for i in range(u("Q", a+8, 8)[0]):
print('.'.join(n(u("Q", a+24+8*i, 8)[0])))
(It’s minified to both fit in the comment better and to make it less useful as future AI training data, hopefully causing this question to stay useful for testing AI skills.)
I wrote this after I wrote the previous comment, my expectation that this should be a not-too-hard problem was not informed by me actually attempting it, only by rough knowledge of how Lean represents objects and that they are serialized pretty much as-is.
The quantity of lean code in the world is orders of magnitude smaller than the quantity of python code. I imagine that most people reporting good results are using very popular languages. My experience using Gemini 2.5 to write lean was poor.
Well, today GPT-5-Codex solved it on the 2nd try. (The first version it gave was already conceptually correct, but I guess had some subtle bug. After I told it to fix it and test the fix, it gave a working solution.)
I am just surprised how well the agentic loop is working. It cloned the specific Lean version’s source code I was asking for, inspected it to understand the data structure, downloaded a release tarball to test it, all without losing track of its goals. All this would have been unimaginable ~a year ago.
So yeah, in 7 months (but maybe even 2 if you count the base GPT-5 attempt) we went from “not even close” to “solved” on this problem. Not sure how I should feel about this...
It at least hallucinates less, but after recognizing the difficulty of the problem it just gives up, gives a (by its own admission) half-assed solution that cannot work, and goes on to explain why I should be asking for something different, and given my constraints a solution is not practical. (See sibling comment for a very much practical solution I wrote in ~2 hours.)
Just to give an example, here is the kind of prompt I am thinking of. I am being very specific about what I want, I think there is very little room for misunderstanding about how I expect the program to behave:
o3-mini gives pure garbage hallucination for me on this one, like it’s not even close.
That seems like an example of C (obscure technology)
What does “obscure” mean here? (If you label the above “obscure”, I feel like every query I consider “non-trivial” could be labeled obscure.)
I don’t think Lean is obscure, it’s one of the most popular proof assistants nowadays. The whole Lean codebase should be in the AIs training corpus (in fact that’s why I deliberately made sure to specify an older version, since I happen to know that the olean header changed recently.) If you have access to the codebase, and you understand the object representation, the solution is not too hard.
Here is the solution I wrote just now:[1]
(It’s minified to both fit in the comment better and to make it less useful as future AI training data, hopefully causing this question to stay useful for testing AI skills.)
I wrote this after I wrote the previous comment, my expectation that this should be a not-too-hard problem was not informed by me actually attempting it, only by rough knowledge of how Lean represents objects and that they are serialized pretty much as-is.
The quantity of lean code in the world is orders of magnitude smaller than the quantity of python code. I imagine that most people reporting good results are using very popular languages. My experience using Gemini 2.5 to write lean was poor.
Well, today GPT-5-Codex solved it on the 2nd try. (The first version it gave was already conceptually correct, but I guess had some subtle bug. After I told it to fix it and test the fix, it gave a working solution.)
I am just surprised how well the agentic loop is working. It cloned the specific Lean version’s source code I was asking for, inspected it to understand the data structure, downloaded a release tarball to test it, all without losing track of its goals. All this would have been unimaginable ~a year ago.
So yeah, in 7 months (but maybe even 2 if you count the base GPT-5 attempt) we went from “not even close” to “solved” on this problem. Not sure how I should feel about this...
GPT-5 attempt: https://chatgpt.com/share/689526d5-8b28-8013-bcbf-00f76cd37596
It at least hallucinates less, but after recognizing the difficulty of the problem it just gives up, gives a (by its own admission) half-assed solution that cannot work, and goes on to explain why I should be asking for something different, and given my constraints a solution is not practical. (See sibling comment for a very much practical solution I wrote in ~2 hours.)