Does this mean think about a feasible-seeming GPQA question until you assign a 95% probability that you’re guess for the answer is correct, or what exactly do you mean by this? (Seems slightly weird to me to do that the same day people start calibration training. I’d guess most people to be wrong more often than 5% if they do that but idk.)
What was your motivation behind this exercise?
(<50% I will try this particular exercise, but) can you give a bit more precise intructions: Should I choose a problem in a field I know well or where I don’t know much? Can I use GPT-o1 or just Google or nothing? Where exactly can I see the questions and answers?
You think about it until you are 95% confident. Yep, this is pretty hard, but I think calibration on this sort of thing is pretty different from many other types of calibration and it’s not that helpful to have practiced in advance. Basically the entire weekend is doing hard puzzles (and real world planning), and making Fatebook predictions about it, to start training that skill in realistic/meaningful circumstances.
The idea behind 95% confidence is that you have thought about the problem until you think you really thoroughly understand it, rather than getting pretty close and saying “well, good enough”
In the Thinking Physics exercise (I hadn’t to not do in this workshop but probably will in future ones), the rules are “no internet or getting help at all, except for when pairing with a specific partner who has also just started thinking about it.”
In GPQA, the rules are “you can use the internet, but no LLMs” (both because LLMs feel a bit more cheating-y, but more important because it violates the ‘don’t leak this to LLMs’ policy you are supposed to sign before downloading GPQA.
Also: when I normally do this exercise, I give people unlimited time (because it’s really not supposed to be about rushing)
During the workshop context, there is some practicality of ‘well, we do kinda need to keep moving to the next session’ so I typically give a 1.5 or 2 hour time window, which in practice is enough for some people but not others.
Does this mean think about a feasible-seeming GPQA question until you assign a 95% probability that you’re guess for the answer is correct, or what exactly do you mean by this? (Seems slightly weird to me to do that the same day people start calibration training. I’d guess most people to be wrong more often than 5% if they do that but idk.)
What was your motivation behind this exercise?
(<50% I will try this particular exercise, but) can you give a bit more precise intructions: Should I choose a problem in a field I know well or where I don’t know much? Can I use GPT-o1 or just Google or nothing? Where exactly can I see the questions and answers?
You think about it until you are 95% confident. Yep, this is pretty hard, but I think calibration on this sort of thing is pretty different from many other types of calibration and it’s not that helpful to have practiced in advance. Basically the entire weekend is doing hard puzzles (and real world planning), and making Fatebook predictions about it, to start training that skill in realistic/meaningful circumstances.
The idea behind 95% confidence is that you have thought about the problem until you think you really thoroughly understand it, rather than getting pretty close and saying “well, good enough”
(See Exercise: Solve “Thinking Physics” for some more thoughts)
In the Thinking Physics exercise (I hadn’t to not do in this workshop but probably will in future ones), the rules are “no internet or getting help at all, except for when pairing with a specific partner who has also just started thinking about it.”
In GPQA, the rules are “you can use the internet, but no LLMs” (both because LLMs feel a bit more cheating-y, but more important because it violates the ‘don’t leak this to LLMs’ policy you are supposed to sign before downloading GPQA.
Also: when I normally do this exercise, I give people unlimited time (because it’s really not supposed to be about rushing)
During the workshop context, there is some practicality of ‘well, we do kinda need to keep moving to the next session’ so I typically give a 1.5 or 2 hour time window, which in practice is enough for some people but not others.