How does this handle the situation where the AI, in some scenario, picks up the idea of “deception” and then, when it describes its behavior honestly by intending to mislead the observer into thinking that it is honest, due to noticing that it is probably inside a training scenario, then gets reinforcement trained on dishonest behaviors that present as honest, ie. deceptive honesty?
I’m not sure exactly what you mean. If we get an output that says “I am going to tell you that I am going to pick up the green crystals, but I’m really going to pick up the yellow crystals”, then that’s a pretty good scenario, since we still know its end behavior.
I think what you mean is the scenario where the agent tells us the truth the entire time it is in simulation but then lies in the real world. That is definitely a bad scenario. And this model doesn’t prevent that from happening.
There are ideas that do (deception takes additional compute vs honesty, so you can refine the agent to be as efficient as possible with its compute). However, I think the biggest space of catastrophe is basic interpretability.
We have no idea what the agent is thinking because it can’t talk with us. By allowing it to communicate and training it to communicate honestly, we seem to have a much greater chance of getting benevolent AI.
Given the timelines, we need to improve our odds as much as possible. This isn’t a perfect solution, but it does seem like it is on the path to it.
How does this handle the situation where the AI, in some scenario, picks up the idea of “deception” and then, when it describes its behavior honestly by intending to mislead the observer into thinking that it is honest, due to noticing that it is probably inside a training scenario, then gets reinforcement trained on dishonest behaviors that present as honest, ie. deceptive honesty?
I’m not sure exactly what you mean. If we get an output that says “I am going to tell you that I am going to pick up the green crystals, but I’m really going to pick up the yellow crystals”, then that’s a pretty good scenario, since we still know its end behavior.
I think what you mean is the scenario where the agent tells us the truth the entire time it is in simulation but then lies in the real world. That is definitely a bad scenario. And this model doesn’t prevent that from happening.
There are ideas that do (deception takes additional compute vs honesty, so you can refine the agent to be as efficient as possible with its compute). However, I think the biggest space of catastrophe is basic interpretability.
We have no idea what the agent is thinking because it can’t talk with us. By allowing it to communicate and training it to communicate honestly, we seem to have a much greater chance of getting benevolent AI.
Given the timelines, we need to improve our odds as much as possible. This isn’t a perfect solution, but it does seem like it is on the path to it.