I suspect that it’s a tooling and scaffolding issue and that e.g.claude-3-5-sonnet-20241022 can get at least 70% on the full set of 60 with decent prompting and tooling.
By “tooling and scaffolding” I mean something along the lines of
Naming the lists that the model submits (e.g. “round 7 list 2”)
A tool where the LLM can submit a named hypothesis in the form of a python function which takes a list and returns a boolean and check whether the results of that function on all submitted lists from previous rounds match the answers it’s seen so far
Seeing the round number on every round
Dropping everything except the seen lists and non-falsified named hypotheses in the context each round (this is more of a practical thing to avoid using absurd volumes of tokens, but I imagine it wouldn’t hurt performance too much)
Terrific, I’m excited to hear about your results! I definitely wouldn’t be surprised if my results could be improved on significantly, although I’ll be somewhat surprised if you get as high as 70% from Sonnet (I’d put maybe 30% credence on getting it to average that high in a day or two of trying).
I suspect that it’s a tooling and scaffolding issue and that e.g.
claude-3-5-sonnet-20241022
can get at least 70% on the full set of 60 with decent prompting and tooling.By “tooling and scaffolding” I mean something along the lines of
Naming the lists that the model submits (e.g. “round 7 list 2”)
A tool where the LLM can submit a named hypothesis in the form of a python function which takes a list and returns a boolean and check whether the results of that function on all submitted lists from previous rounds match the answers it’s seen so far
Seeing the round number on every round
Dropping everything except the seen lists and non-falsified named hypotheses in the context each round (this is more of a practical thing to avoid using absurd volumes of tokens, but I imagine it wouldn’t hurt performance too much)
I’ll probably play around with it a bit tomorrow.
Terrific, I’m excited to hear about your results! I definitely wouldn’t be surprised if my results could be improved on significantly, although I’ll be somewhat surprised if you get as high as 70% from Sonnet (I’d put maybe 30% credence on getting it to average that high in a day or two of trying).