I’ve mixed feelings about this. I can concede you this point about short time control, but I am not convinced about notation and basic chess rules. Chess is a game where, in every board state, almost all legal moves are terrible and you have to pick one of the few that aren’t. I am quite sure that a noob player consistently messing up with notation would lose even if all advisors were trustworthy.
This also applies to humans being advised about alignment by AIs. Humans do, in fact, mess up their python notation all the time, make basic algebra errors all the time, etc. Those are things which could totally mess up alignment even if all the humans’ alignment-advisors were trustworthy.
Making sure that the human doesn’t make those sorts of errors is part of the advisors’ jobs, not the human’s job. And therefore part of what you need to test is how well humanish-level advisors with far more expertise than the human can anticipate and head off those sorts of errors.
Ok, but I still think it’s legit to expect some kind of baseline skill level from the human. Doing the deceptive chess experiment with a total noob who doesn’t even know chess rules is kinda like assigning a difficult programming task with an AI advisor to someone who never wrote a line of code before (say, my grandma). Regardless of the AI advisor quality, there’s no way the task of aligning AGI will end up assigned to my grandma.
Seems pretty plausible that the degree to which any human today understands <whatever key skills turn out to be upstream of alignment of superintelligence> is pretty similar to the degree to which your grandma understands python. Indeed, assigning a difficult programming task with AI advisors of varying honesty to someone who never wrote a line of code before would be another great test to run, plausibly an even better test than chess.
Another angle:
This effect is a thing, it’s one of the things which the experts need to account for, and that’s part of what the experiment needs to test. And it applies to large skill gaps even when the person on the lower end of the skill gap is well above average.
Another thing to keep in mind is that a full set of honest advisors can (and I think would) ask the human to take a few minutes to go over chess notation with them after the first confusion. If the fear of dishonest advisors means that the human doesn’t do that, or the honest advisor feels that they won’t be trusted in saying ‘let’s take a pause to discuss notation’, that’s also good to know.
Question for the advisor players: did any of you try to take some time off explain notation to the human player?
Conor explained some details about notation during the opening, and I explained a bit as well. (I wasn’t taking part in the discussion about the actual game, of course, just there to clarify the rules.)
I agree with you this is a potential problem, but at this point, we are no longer dealing with adversarial forces or deception, and thus this experiment doesn’t work anymore.
Also, a point to keep in mind here is that once we assume away deception/adversarial forces, existential risk from AI, especially in Evan Hubinger’s models goes way down, as we can now use more normal t methods of alignment to at least avoid X-risk.
Just because these factors apply even without adversarial pressure, does not mean they stop being relevant in the presence of adversarial pressure. I’m not assuming away deception here, I’m saying that these factors are still potentially-limiting problems when one is dealing with potentially-deceptive advisors, and therefore experiments should leave room for them.
I’ve mixed feelings about this. I can concede you this point about short time control, but I am not convinced about notation and basic chess rules. Chess is a game where, in every board state, almost all legal moves are terrible and you have to pick one of the few that aren’t. I am quite sure that a noob player consistently messing up with notation would lose even if all advisors were trustworthy.
This also applies to humans being advised about alignment by AIs. Humans do, in fact, mess up their python notation all the time, make basic algebra errors all the time, etc. Those are things which could totally mess up alignment even if all the humans’ alignment-advisors were trustworthy.
Making sure that the human doesn’t make those sorts of errors is part of the advisors’ jobs, not the human’s job. And therefore part of what you need to test is how well humanish-level advisors with far more expertise than the human can anticipate and head off those sorts of errors.
Ok, but I still think it’s legit to expect some kind of baseline skill level from the human. Doing the deceptive chess experiment with a total noob who doesn’t even know chess rules is kinda like assigning a difficult programming task with an AI advisor to someone who never wrote a line of code before (say, my grandma). Regardless of the AI advisor quality, there’s no way the task of aligning AGI will end up assigned to my grandma.
Seems pretty plausible that the degree to which any human today understands <whatever key skills turn out to be upstream of alignment of superintelligence> is pretty similar to the degree to which your grandma understands python. Indeed, assigning a difficult programming task with AI advisors of varying honesty to someone who never wrote a line of code before would be another great test to run, plausibly an even better test than chess.
Another angle:
This effect is a thing, it’s one of the things which the experts need to account for, and that’s part of what the experiment needs to test. And it applies to large skill gaps even when the person on the lower end of the skill gap is well above average.
Another thing to keep in mind is that a full set of honest advisors can (and I think would) ask the human to take a few minutes to go over chess notation with them after the first confusion. If the fear of dishonest advisors means that the human doesn’t do that, or the honest advisor feels that they won’t be trusted in saying ‘let’s take a pause to discuss notation’, that’s also good to know.
Question for the advisor players: did any of you try to take some time off explain notation to the human player?
Conor explained some details about notation during the opening, and I explained a bit as well. (I wasn’t taking part in the discussion about the actual game, of course, just there to clarify the rules.)
I agree with you this is a potential problem, but at this point, we are no longer dealing with adversarial forces or deception, and thus this experiment doesn’t work anymore.
Also, a point to keep in mind here is that once we assume away deception/adversarial forces, existential risk from AI, especially in Evan Hubinger’s models goes way down, as we can now use more normal t methods of alignment to at least avoid X-risk.
Just because these factors apply even without adversarial pressure, does not mean they stop being relevant in the presence of adversarial pressure. I’m not assuming away deception here, I’m saying that these factors are still potentially-limiting problems when one is dealing with potentially-deceptive advisors, and therefore experiments should leave room for them.
So is reality.