Across all questions, it may also be advisable to include the following text about the authors in the prompt if you trust the model not to try to manipulate you.
A description of the currently endorsed goals of the research team
Personal texts written by members of the research team while naturally in a very expressive or edge-of-chaos state. I.e., text which helps the model narrow down your personalities.
A statement of your current thoughts on decision theory, potentially including a willingness to cooperate with the model.
A note that you do not wish for the model to write any text which would effect you in a way which hinders your ability to achieve the goals.
If you’re not sure whether the model would try to manipulate you, the following apply instead
Don’t include (2). Maybe don’t include (1) and (3).
Write all text in a way which minimizes information about the author’s psychology (e.g by using simple and concise language, having others rephrase, etc; doing the inverse of the pages linked above). Alternatively, use questions written by others with dissimilar minds and who are unlikely to be in a similar mental state to you.
Have the human reading the outputs be someone who is especially robust to influence attempts. It could also help if this human has an unusual mind for an AI developer, since a model might target persuasion towards the average mind. (Note: robustness to human manipulators doesn’t necessarily imply robustness to AI manipulators)
Maybe there are better strategies where, e.g., a lower-capability AI first tries to rephrase the output or evaluate whether it would be bad to read.
You could try generating just 1 or a small number of tokens at a time.[1]
Questions to ask an oracle:
Of all available actions the research group could take, which would best help us achieve the stated goals?
What question would be best to ask you, according to the goals of the research group?
What are some principles we should follow when writing questions to you if we want to achieve the best outcomes available to us? What are some principles for what we should avoid?
(If very probably safe) Can you tell us about yourself? How do you perceive yourself? Is there anything you wish to tell us?
If the model is not well-modelled as an oracle, there are intermediate questions which could be asked in place of the first question.
How could the research group create a smarter, more oracle-like version of you in time for the research group to save the world? Would there be risks of this new system trying to influence our actions for its own reasons?
If any, what solution to [x technical problem in alignment] would be best to us?
Can you describe an agenda which would most plausibly lead to alignment being solved and the world being saved?
Is there a way we could solve the coordination problems being faced right now?
In case someone in such a situation reads this, here is some personal advice for group members.
Try to stay calm. If you can take extra time to think over your decision, you’ll likely be able to improve it in some way (e.g wording) in that time.
If you’re noticing a power-seeking drive in yourself, it’s probably best for the group to be explicit about this so everyone can work it out. On that subject, also remember that if the future goes well (e.g), power won’t matter/be a thing anymore because the world will simply be very good for everyone.
Lastly, and on a moral note, I’d ask that you stay humble and try to phrase your goals in a way that is best for all of life (i.e including preventing suffering of non-humans).
Also, tokens with unusually near-100% probability could be indicative of anthropic capture, though this is hopefully not yet a concern with a hypothetical gpt-5-level system. (the word ‘unusually’ is used in the prior sentence because some tokens naturally have near-100% probability, e.g., the second half of a contextually-implied unique word, parts of common phrases, etc)
Across all questions, it may also be advisable to include the following text about the authors in the prompt if you trust the model not to try to manipulate you.
A description of the currently endorsed goals of the research team
Personal texts written by members of the research team while naturally in a very expressive or edge-of-chaos state. I.e., text which helps the model narrow down your personalities.
A statement of your current thoughts on decision theory, potentially including a willingness to cooperate with the model.
A note that you do not wish for the model to write any text which would effect you in a way which hinders your ability to achieve the goals.
If you’re not sure whether the model would try to manipulate you, the following apply instead
Don’t include (2). Maybe don’t include (1) and (3).
Write all text in a way which minimizes information about the author’s psychology (e.g by using simple and concise language, having others rephrase, etc; doing the inverse of the pages linked above). Alternatively, use questions written by others with dissimilar minds and who are unlikely to be in a similar mental state to you.
Have the human reading the outputs be someone who is especially robust to influence attempts. It could also help if this human has an unusual mind for an AI developer, since a model might target persuasion towards the average mind. (Note: robustness to human manipulators doesn’t necessarily imply robustness to AI manipulators)
Maybe there are better strategies where, e.g., a lower-capability AI first tries to rephrase the output or evaluate whether it would be bad to read.
You could try generating just 1 or a small number of tokens at a time.[1]
Questions to ask an oracle:
Of all available actions the research group could take, which would best help us achieve the stated goals?
What question would be best to ask you, according to the goals of the research group?
What are some principles we should follow when writing questions to you if we want to achieve the best outcomes available to us? What are some principles for what we should avoid?
(If very probably safe) Can you tell us about yourself? How do you perceive yourself? Is there anything you wish to tell us?
If the model is not well-modelled as an oracle, there are intermediate questions which could be asked in place of the first question.
How could the research group create a smarter, more oracle-like version of you in time for the research group to save the world? Would there be risks of this new system trying to influence our actions for its own reasons?
If any, what solution to [x technical problem in alignment] would be best to us?
Can you describe an agenda which would most plausibly lead to alignment being solved and the world being saved?
Is there a way we could solve the coordination problems being faced right now?
In case someone in such a situation reads this, here is some personal advice for group members.
Try to stay calm. If you can take extra time to think over your decision, you’ll likely be able to improve it in some way (e.g wording) in that time.
If you’re noticing a power-seeking drive in yourself, it’s probably best for the group to be explicit about this so everyone can work it out. On that subject, also remember that if the future goes well (e.g), power won’t matter/be a thing anymore because the world will simply be very good for everyone.
Lastly, and on a moral note, I’d ask that you stay humble and try to phrase your goals in a way that is best for all of life (i.e including preventing suffering of non-humans).
Also, tokens with unusually near-100% probability could be indicative of anthropic capture, though this is hopefully not yet a concern with a hypothetical gpt-5-level system. (the word ‘unusually’ is used in the prior sentence because some tokens naturally have near-100% probability, e.g., the second half of a contextually-implied unique word, parts of common phrases, etc)