I think the problem you mention is a real challenge, but not the main limitation of this idea.
The problem you mention actually decreases with greater intelligence and capabilities, since a smarter AI clearly understands the concept of being corrigible to its creators vs. a random guy on the street, just like a human does.
The main problem is still how reinforcement learning trains the AI behaviours which actually maximize reward, while corrigibility only trains the AI behaviours which appear corrigibile.
Discriminating on the basis of the creators vs a random guy on the street helps with many of the easiest cases, but in an adversarial context, it’s not enough to have something that works for all the easiest cases, you need something that can’t predictably made to fail by a highly motivated adversary.
Like you could easily do some sort of data augmentation to add attempts at invoking the corrigibility system from random guys on the street, and then train it not to respond to that. But there’ll still be lots of other vulnerabilities.
I still think, once the AI approaches human intelligence (and beyond), this problem should start to go away, since a human soldier can choose to be corrigible to his commander and not the enemy, even in very complex environments.
I still feel the main problem is “the AI doesn’t want to be corrigible,” rather than “making the AI corrigible enables prompt injections.” It’s like that with humans.
That said, I’m highly uncertain about all of this and I could easily be wrong.
If the AI can’t do much without coordinating with a logistics and intelligence network and collaborating with a number of other agents, and its contact to this network routes through a commanding agent that is as capable if not more capable than the AI itself, then sure, it may be relatively feasible to make the AI corrigible to said commanding agent, if that is what you want it to be.
(This is meant to be analogous to the soldier-commander example.)
But was that the AI regime you expect to find yourself working with? In particular I’d expect you expect that the commanding agent would be another AI, in which case being corrigible to them is not sufficient.
Oops I didn’t mean that analogy. It’s not necessarily a commander, but any individual that a human chooses to be corrigible/loyal to. A human is capable of being corrigible/loyal to one person (or group), without accruing the risk of listening to prompt injections, because a human has enough general intelligence/common sense to know what is a prompt injection and what is a request from the person he is corrigible/loyal to.
As AI approach human intelligence, they would be capable of this too.
Can you give 1 example of a person choosing to be corrigible to someone they are not dependent upon for resources/information and who they have much more expertise than?
Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?
Do you mean “resigns from a presidential position/declines a dictatorial position because they disagree with the will of the people” or “makes policy they know will be bad because the people demand it”?
Maybe a good parent who listens to his/her child’s dreams?
Maybe someone like George Washington who was so popular he could easily stay in power, but still chose to make America democratic. Let’s hope it stays democratic :/
No human is 100% corrigible and would do anything that someone else wants. But a good parent might help his/her child get into sports and so forth but if the child says he/she wants to be a singer instead the parent helps him/her on that instead. The outcome the parent wants depends on what the child wants, and the child can change his/her mind.
I think the problem you mention is a real challenge, but not the main limitation of this idea.
The problem you mention actually decreases with greater intelligence and capabilities, since a smarter AI clearly understands the concept of being corrigible to its creators vs. a random guy on the street, just like a human does.
The main problem is still how reinforcement learning trains the AI behaviours which actually maximize reward, while corrigibility only trains the AI behaviours which appear corrigibile.
Discriminating on the basis of the creators vs a random guy on the street helps with many of the easiest cases, but in an adversarial context, it’s not enough to have something that works for all the easiest cases, you need something that can’t predictably made to fail by a highly motivated adversary.
Like you could easily do some sort of data augmentation to add attempts at invoking the corrigibility system from random guys on the street, and then train it not to respond to that. But there’ll still be lots of other vulnerabilities.
I still think, once the AI approaches human intelligence (and beyond), this problem should start to go away, since a human soldier can choose to be corrigible to his commander and not the enemy, even in very complex environments.
I still feel the main problem is “the AI doesn’t want to be corrigible,” rather than “making the AI corrigible enables prompt injections.” It’s like that with humans.
That said, I’m highly uncertain about all of this and I could easily be wrong.
If the AI can’t do much without coordinating with a logistics and intelligence network and collaborating with a number of other agents, and its contact to this network routes through a commanding agent that is as capable if not more capable than the AI itself, then sure, it may be relatively feasible to make the AI corrigible to said commanding agent, if that is what you want it to be.
(This is meant to be analogous to the soldier-commander example.)
But was that the AI regime you expect to find yourself working with? In particular I’d expect you expect that the commanding agent would be another AI, in which case being corrigible to them is not sufficient.
Oops I didn’t mean that analogy. It’s not necessarily a commander, but any individual that a human chooses to be corrigible/loyal to. A human is capable of being corrigible/loyal to one person (or group), without accruing the risk of listening to prompt injections, because a human has enough general intelligence/common sense to know what is a prompt injection and what is a request from the person he is corrigible/loyal to.
As AI approach human intelligence, they would be capable of this too.
Can you give 1 example of a person choosing to be corrigible to someone they are not dependent upon for resources/information and who they have much more expertise than?
Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?
Maybe a good parent who listens to his/her child’s dreams?
Very good question though. Humans usually aren’t very corrigible, and there aren’t many examples!
Do you mean “resigns from a presidential position/declines a dictatorial position because they disagree with the will of the people” or “makes policy they know will be bad because the people demand it”?
Can you expand on this?
Maybe someone like George Washington who was so popular he could easily stay in power, but still chose to make America democratic. Let’s hope it stays democratic :/
No human is 100% corrigible and would do anything that someone else wants. But a good parent might help his/her child get into sports and so forth but if the child says he/she wants to be a singer instead the parent helps him/her on that instead. The outcome the parent wants depends on what the child wants, and the child can change his/her mind.