I suspect that it’s a tooling and scaffolding issue and that e.g.claude-3-5-sonnet-20241022
can get at least 70% on the full set of 60 with decent prompting and tooling.
By “tooling and scaffolding” I mean something along the lines of
Naming the lists that the model submits (e.g. “round 7 list 2”)
A tool where the LLM can submit a named hypothesis in the form of a python function which takes a list and returns a boolean and check whether the results of that function on all submitted lists from previous rounds match the answers it’s seen so far
Seeing the round number on every round
Dropping everything except the seen lists and non-falsified named hypotheses in the context each round (this is more of a practical thing to avoid using absurd volumes of tokens, but I imagine it wouldn’t hurt performance too much)
I’ll probably play around with it a bit tomorrow.
Scaffolded LLMs are pretty good at not just writing code, but also at refactoring it. So that means that all the tech debt in the world will disappear soon, right?
I predict “no” because
As writing code gets cheaper, the relative cost of making sure that a refactor didn’t break anything important goes up
The number of parallel threads of software development will also go up, with multiple high-value projects making mutually-incompatible assumptions (and interoperability between these projects accomplished by just piling on more cose).
As such, I predict an explosion of software complexity and jank in the near future.