WRT non-manipulation, I don’t suppose there’s an easy way to have the AI track how much potentially manipulative influence it’s “supposed to have” in the context and avoid exercising more than that influence?
Or possibly better, compare simple implementations of the principle’s instructions, and penalize interpretations with large/unusual influence on the principle’s values. Preferably without prejudicing interventions straightforwardly protecting the principle’s safety and communication channels.
Principle should, for example, be able to ask the AI to “teach them about philosophy”, without it either going out of it’s way to ensure Principle doesn’t change their mind about anything as a result of the instruction, nor unduly influencing them with subtly chosen explanations or framing. The AI should exercise an “ordinary” amount of influence typical of the ways AI could go about implementing the instruction.
Presumably there’s a distribution around how manipulative/anti-manipulative(value-preserving) any given implementation of the instruction is, and we may want AI to prefer central implementations rather than extremely value-preserving ones.
Ideally AI should also worry that it’s contemplating exercising more or less influence than desired, and clarify that as it would any other aspect of the task.
That’s an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there’s a Propogandist who gives resources to agents that brainwash their principals into having certain values. If “teach me about philosophy” comes with an influence budget, it seems critical that the AI doesn’t spend that budget trading with Propagandist, and instead does so in a more “central” way.
Still, the idea of instructions carrying a degree of approved influence seems promising.
Good clarification; not just the amount of influence, something about the way influence is exercised being unsurprising given the task. Central not just in terms of “how much influence”, but also along whatever other axes the sort of influence could vary?
I think if the agent’s action space is still so unconstrained there’s room to consider benefit or harm that flows through principle value modification it’s probably still been given too much latitude. Once we have informed consent, because the agent has has communicated the benefits and harms as best it understands, it should have very little room to be influenced by benefits and harms it thought too trivial to mention (by virtue of their triviality).
At the same time, it’s not clear the agent should, absent further direction, reject the offer to brainwash the principle for resources, as opposed to punting to the principle. Maybe the principle thinks those values are an improvement and it’s free money? [e.g. Prince’s insurance company wants to bribe him to stop smoking.]
WRT non-manipulation, I don’t suppose there’s an easy way to have the AI track how much potentially manipulative influence it’s “supposed to have” in the context and avoid exercising more than that influence?
Or possibly better, compare simple implementations of the principle’s instructions, and penalize interpretations with large/unusual influence on the principle’s values. Preferably without prejudicing interventions straightforwardly protecting the principle’s safety and communication channels.
Principle should, for example, be able to ask the AI to “teach them about philosophy”, without it either going out of it’s way to ensure Principle doesn’t change their mind about anything as a result of the instruction, nor unduly influencing them with subtly chosen explanations or framing. The AI should exercise an “ordinary” amount of influence typical of the ways AI could go about implementing the instruction.
Presumably there’s a distribution around how manipulative/anti-manipulative(value-preserving) any given implementation of the instruction is, and we may want AI to prefer central implementations rather than extremely value-preserving ones.
Ideally AI should also worry that it’s contemplating exercising more or less influence than desired, and clarify that as it would any other aspect of the task.
That’s an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there’s a Propogandist who gives resources to agents that brainwash their principals into having certain values. If “teach me about philosophy” comes with an influence budget, it seems critical that the AI doesn’t spend that budget trading with Propagandist, and instead does so in a more “central” way.
Still, the idea of instructions carrying a degree of approved influence seems promising.
Good clarification; not just the amount of influence, something about the way influence is exercised being unsurprising given the task. Central not just in terms of “how much influence”, but also along whatever other axes the sort of influence could vary?
I think if the agent’s action space is still so unconstrained there’s room to consider benefit or harm that flows through principle value modification it’s probably still been given too much latitude. Once we have informed consent, because the agent has has communicated the benefits and harms as best it understands, it should have very little room to be influenced by benefits and harms it thought too trivial to mention (by virtue of their triviality).
At the same time, it’s not clear the agent should, absent further direction, reject the offer to brainwash the principle for resources, as opposed to punting to the principle. Maybe the principle thinks those values are an improvement and it’s free money? [e.g. Prince’s insurance company wants to bribe him to stop smoking.]