That’s a challenge, and while you (hopefully) chew on it, I’ll tell an implausibly-detailed story to exemplify a deeper obstacle.
Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:
There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won’t affect them cause they’re not there yet.
General capabilities imply the ability to be deceptive if useful in a particular circumstance. So you can’t just train away the capability to be deceptive (or maybe you can, but not in a way that is robust wrt general capability gains).
Really you want to train against the propensity to be deceptive, rather than the capability. But propensities also change with capability level; becoming more capable is all about having more ways to achieve your goals. So eliminating propensity to be deceptive at a lower capability level does not eliminate the propensity at a higher capability level.
The robust way to get rid of propensity to be deceptive is to reach an attractor where more capability == less deception (within the capability range we care about), because the AI’s terminal goals on some level include ‘being nondeceptive’.
Before we can align the AIs goals to human intent in this way, the AI needs to have a good understanding of human intent, good situational awareness, and be a (more or less) unified / coherent agent. If it’s not, then its goals / propensities will shift as it becomes more capable (or more situationally aware, or more coherent, etc)
This is a pretty harsh set of prerequisites, and is probably outside of the range of circumstances where people usually hope their method to avoid deception will work.
Even if methods to detect deception (narrowly conceived) work, we cannot tell apart an agent that is actually nondeceptive / aligned from an agent that e.g. just aims to play the training game (and will do something unspecified once it reaches a capability threshold that allows it to breach containment).
A specific (maybe too specific) problem that can still happen in this scenario: you might get an AI that is overall capable, but just learns to not think long enough about scenarios that would lead it to try to be deceptive. This can still happen at the maximum capability levels at which we might hope to still contain an AGI that we are trying to align (ie somewhere around human level, optimistically).
Honesty is an attractor in the cooperative multi-agent system, where one agent relies on the other agents having accurate information to do their part of the work.
I don’t think understanding an intent is the hardest part. Even the curent LLMs are mostly able to do that.
Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:
There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won’t affect them cause they’re not there yet.
General capabilities imply the ability to be deceptive if useful in a particular circumstance. So you can’t just train away the capability to be deceptive (or maybe you can, but not in a way that is robust wrt general capability gains).
Really you want to train against the propensity to be deceptive, rather than the capability. But propensities also change with capability level; becoming more capable is all about having more ways to achieve your goals. So eliminating propensity to be deceptive at a lower capability level does not eliminate the propensity at a higher capability level.
The robust way to get rid of propensity to be deceptive is to reach an attractor where more capability == less deception (within the capability range we care about), because the AI’s terminal goals on some level include ‘being nondeceptive’.
Before we can align the AIs goals to human intent in this way, the AI needs to have a good understanding of human intent, good situational awareness, and be a (more or less) unified / coherent agent. If it’s not, then its goals / propensities will shift as it becomes more capable (or more situationally aware, or more coherent, etc)
This is a pretty harsh set of prerequisites, and is probably outside of the range of circumstances where people usually hope their method to avoid deception will work.
Even if methods to detect deception (narrowly conceived) work, we cannot tell apart an agent that is actually nondeceptive / aligned from an agent that e.g. just aims to play the training game (and will do something unspecified once it reaches a capability threshold that allows it to breach containment).
A specific (maybe too specific) problem that can still happen in this scenario: you might get an AI that is overall capable, but just learns to not think long enough about scenarios that would lead it to try to be deceptive. This can still happen at the maximum capability levels at which we might hope to still contain an AGI that we are trying to align (ie somewhere around human level, optimistically).
Honesty is an attractor in the cooperative multi-agent system, where one agent relies on the other agents having accurate information to do their part of the work.
I don’t think understanding an intent is the hardest part. Even the curent LLMs are mostly able to do that.