I work at ARC Evals. I like language models.
Am very happy for people to ask to chat—but I might be too busy to accept (message me).
I work at ARC Evals. I like language models.
Am very happy for people to ask to chat—but I might be too busy to accept (message me).
Just want to point to a more recent (2021) paper implementing adaptive computation by some DeepMind researchers that I found interesting when I was looking into this:
Something I’m unsure about here is whether it is possible to separately condition on worlds where X is in fact the case, vs worlds where all the relevant humans (or other text-writing entities) just wrongly believe that X is the case.
Essentially, is the prompt (particularly the observation) describing the actual facts about this world, or just the beliefs of some in-world text-writing entity? Given that language is often (always?) written by fallible entities, it seems at least not unreasonable to me to assume the second rather than the first interpretation.
This difference seems relevant to prompts aimed at weeding out deceptive alignment in particular. Since in the prompts as beliefs case, the same prompt could cause conditioning both on worlds where we have in fact solved X problem, but also worlds where we are being actively misled into believing that we have solved X problem (when we actually haven’t).