I disagree. I think that there is an important point to be made here about AI safety. A lot of people have made the argument that ‘an agent which can only work by communicating via text on a computer screen, sent to a human who has been pre-warned to not let it out’ can never result in a reasonable thoughtful intelligent human choosing to let it out.
I think that the fact that this experiment has been run a few times, and sometimes results in the guardian loosing presents at least some evidence that this claim of persuasion-immunity is false.
I think this claim still matters for AI safety, even in the current world.
Yes, the frontier labs have been their models access the internet, and communicate to the wider world. But they have not been doing so without safety screening. This safety screening involves, at least in part, a human reading text produced by the model and deciding ‘it is ok to approve this model for contact with the world’.
Currently, I think that the frontier labs contain a lot of employees who would be more in agreement with the description of themselves as not vulnerable to being persuaded to incorrectly approve a model for release. I think this is overconfidence on their part. Humans are vulnerable to being tricked and conned. I’m not saying that nobody can ever avoid being persuaded, just that we can’t assume the robustness of this safety measure and should devise better safety measures.
ok sure, but this game is still terrible and is evidence against clear thinking about the problem at hand on the part of anyone who plays it as a test of anything but whether they’ll win a social deception game. perhaps revealing the transcript would fix it; I sort of doubt I’d be convinced then either. I just don’t think there’s a good way to set this up so there are any significant constraints making the situation similar.
I disagree. I think that there is an important point to be made here about AI safety. A lot of people have made the argument that ‘an agent which can only work by communicating via text on a computer screen, sent to a human who has been pre-warned to not let it out’ can never result in a reasonable thoughtful intelligent human choosing to let it out. I think that the fact that this experiment has been run a few times, and sometimes results in the guardian loosing presents at least some evidence that this claim of persuasion-immunity is false. I think this claim still matters for AI safety, even in the current world. Yes, the frontier labs have been their models access the internet, and communicate to the wider world. But they have not been doing so without safety screening. This safety screening involves, at least in part, a human reading text produced by the model and deciding ‘it is ok to approve this model for contact with the world’. Currently, I think that the frontier labs contain a lot of employees who would be more in agreement with the description of themselves as not vulnerable to being persuaded to incorrectly approve a model for release. I think this is overconfidence on their part. Humans are vulnerable to being tricked and conned. I’m not saying that nobody can ever avoid being persuaded, just that we can’t assume the robustness of this safety measure and should devise better safety measures.
ok sure, but this game is still terrible and is evidence against clear thinking about the problem at hand on the part of anyone who plays it as a test of anything but whether they’ll win a social deception game. perhaps revealing the transcript would fix it; I sort of doubt I’d be convinced then either. I just don’t think there’s a good way to set this up so there are any significant constraints making the situation similar.