Something I’m unsure about here is whether it is possible to separately condition on worlds where X is in fact the case, vs worlds where all the relevant humans (or other text-writing entities) just wrongly believe that X is the case.
Essentially, is the prompt (particularly the observation) describing the actual facts about this world, or just the beliefs of some in-world text-writing entity? Given that language is often (always?) written by fallible entities, it seems at least not unreasonable to me to assume the second rather than the first interpretation.
This difference seems relevant to prompts aimed at weeding out deceptive alignment in particular. Since in the prompts as beliefs case, the same prompt could cause conditioning both on worlds where we have in fact solved X problem, but also worlds where we are being actively misled into believing that we have solved X problem (when we actually haven’t).
I’m assuming we can input observations about the world for conditioning, and those don’t need to be text. I didn’t go into this in the post, but for example I think the following are fair game:
Physical newspapers are exist which report BigLab has solved the alignment problem.
A camera positioned 10km above NYC would take a picture consistent with humans walking on the street.
There is data on hard drives consistent with Reddit posts claiming BigCo has perfected interpretability tools.
Whereas the following are not allowed because I don’t see how they could be operationalized:
For the newspaper and reddit post examples, I think false beliefs remain relevant since these are observations about beliefs. For example, the observation of BigCo announcing they have solved alignment is compatible with worlds where they actually have solved alignment, but also with worlds where BigCo have made some mistake and alignment hasn’t actually been solved, even though people in-universe believe that it has. These kinds of ‘mistaken alignment’ worlds seem like they would probably contaminate the conditioning to some degree at least. (Especially if there are ways that early deceptive AIs might be able to manipulate BigCo and others into making these kinds of mistakes).
Something I’m unsure about here is whether it is possible to separately condition on worlds where X is in fact the case, vs worlds where all the relevant humans (or other text-writing entities) just wrongly believe that X is the case.
Essentially, is the prompt (particularly the observation) describing the actual facts about this world, or just the beliefs of some in-world text-writing entity? Given that language is often (always?) written by fallible entities, it seems at least not unreasonable to me to assume the second rather than the first interpretation.
This difference seems relevant to prompts aimed at weeding out deceptive alignment in particular. Since in the prompts as beliefs case, the same prompt could cause conditioning both on worlds where we have in fact solved X problem, but also worlds where we are being actively misled into believing that we have solved X problem (when we actually haven’t).
I’m assuming we can input observations about the world for conditioning, and those don’t need to be text. I didn’t go into this in the post, but for example I think the following are fair game:
Physical newspapers are exist which report BigLab has solved the alignment problem.
A camera positioned 10km above NYC would take a picture consistent with humans walking on the street.
There is data on hard drives consistent with Reddit posts claiming BigCo has perfected interpretability tools.
Whereas the following are not allowed because I don’t see how they could be operationalized:
BigLab has solved the alignment problem.
Alice is not deceptive.
BigCo has perfected interpretability tools.
For the newspaper and reddit post examples, I think false beliefs remain relevant since these are observations about beliefs. For example, the observation of BigCo announcing they have solved alignment is compatible with worlds where they actually have solved alignment, but also with worlds where BigCo have made some mistake and alignment hasn’t actually been solved, even though people in-universe believe that it has. These kinds of ‘mistaken alignment’ worlds seem like they would probably contaminate the conditioning to some degree at least. (Especially if there are ways that early deceptive AIs might be able to manipulate BigCo and others into making these kinds of mistakes).
Fully agreed.