Note that, if there’s enough time, you can, each turn have the experts play full games against each other, and copy the next move distribution of whoever wins the most. The dishonest experts can only win by making good moves, so you get good moves either way. So the remaining question is how possible it is to reduce the time required.
One approach is to set up a prediction market where experts can bet on the value of a given position, and resolve bets by playing out the game. That way, dishonest experts lose currency faster by being dishonest. This still introduces variance in how long a given turn takes, though.
Neither of these would be allowed, because in the real world, you can’t do a bunch of test “games” before or during the actual “game.” There’s no way to perform a proposed alignment plan in a faraway galaxy, check whether that galaxy is destroyed, and make decisions for what to do on Earth based on that data—let alone perform so many of those tests to inform a prediction market based on what they say.
I would have allowed player A to consult a prediction market made by other a bunch of other inexperienced players on who was really honest or lying. After all, in the real world, whoever was making the final decision on what plan to execute would be able to ask a prediction market what it thought. But the problem is that if I make a prediction market that’s supposed to only be for other players around player A’s level, somebody will just use a chess engine to cheat, bet in the market, and make it unrealistically accurate.
Chess is simulable, unlike the real world. If the player and advisors can use paper and pencil, or type in a chat, they can play chess.
I think whether the advisors can use a chess engine is just part of the rules of the game, and you make the prediction market among those relevant advisors only.
Yes, if this were only about chess, then having the advisors play games with each other as A watched would help A learn who to trust. I’m saying that since the real-world scenario we’re trying to model doesn’t allow such a thing to happen, we artificially forbid this in the chess game to make it more like the real-world scenario. The prediction market thing, similarly, would require being able to do a test run so that the dishonest advisors could lose their money by the time A had to make a choice.
I don’t think the advisors should be able to use chess engines, because then even the advisors themselves don’t understand the reasoning behind what the chess engines are saying. The premise of the experiment involves the advisors telling A “this is my reasoning on what the best move is; try to evaluate if it’s right.”
I think it is very hard to artificially forbid that: there isn’t a well-defined boundary between playing out a full game and a conversation like:
“that other advisor says playing Rd4 is bad because of Nxd4, but after Nxd4 you can play Qd6 and win”
“No, Qd6 doesn’t win, playing Bf7 breaks up the attack.”
One thing that might work, though, is to deny back-and-forth between advisors. If each advisor can send one recommendation, and maybe one further response to a question from A, but not have a free-form conversation, that would deny the ability to play out a game.
Yeah, that’s a bit of an issue. I think in real life you would have some back-and-forth ability between advisors, but the complexity and unknowns of the real world would create a qualitative difference between the conversation and an actual game—which chess doesn’t have. Maybe we can either limit back-and-forth like you suggested, or just have short enough time controls that there isn’t enough time for that to get too far.
Note that, if there’s enough time, you can, each turn have the experts play full games against each other, and copy the next move distribution of whoever wins the most. The dishonest experts can only win by making good moves, so you get good moves either way. So the remaining question is how possible it is to reduce the time required.
One approach is to set up a prediction market where experts can bet on the value of a given position, and resolve bets by playing out the game. That way, dishonest experts lose currency faster by being dishonest. This still introduces variance in how long a given turn takes, though.
AI safety via debate could also inspire strategies.
Neither of these would be allowed, because in the real world, you can’t do a bunch of test “games” before or during the actual “game.” There’s no way to perform a proposed alignment plan in a faraway galaxy, check whether that galaxy is destroyed, and make decisions for what to do on Earth based on that data—let alone perform so many of those tests to inform a prediction market based on what they say.
I would have allowed player A to consult a prediction market made by other a bunch of other inexperienced players on who was really honest or lying. After all, in the real world, whoever was making the final decision on what plan to execute would be able to ask a prediction market what it thought. But the problem is that if I make a prediction market that’s supposed to only be for other players around player A’s level, somebody will just use a chess engine to cheat, bet in the market, and make it unrealistically accurate.
Chess is simulable, unlike the real world. If the player and advisors can use paper and pencil, or type in a chat, they can play chess.
I think whether the advisors can use a chess engine is just part of the rules of the game, and you make the prediction market among those relevant advisors only.
Yes, if this were only about chess, then having the advisors play games with each other as A watched would help A learn who to trust. I’m saying that since the real-world scenario we’re trying to model doesn’t allow such a thing to happen, we artificially forbid this in the chess game to make it more like the real-world scenario. The prediction market thing, similarly, would require being able to do a test run so that the dishonest advisors could lose their money by the time A had to make a choice.
I don’t think the advisors should be able to use chess engines, because then even the advisors themselves don’t understand the reasoning behind what the chess engines are saying. The premise of the experiment involves the advisors telling A “this is my reasoning on what the best move is; try to evaluate if it’s right.”
I think it is very hard to artificially forbid that: there isn’t a well-defined boundary between playing out a full game and a conversation like:
“that other advisor says playing Rd4 is bad because of Nxd4, but after Nxd4 you can play Qd6 and win”
“No, Qd6 doesn’t win, playing Bf7 breaks up the attack.”
One thing that might work, though, is to deny back-and-forth between advisors. If each advisor can send one recommendation, and maybe one further response to a question from A, but not have a free-form conversation, that would deny the ability to play out a game.
Yeah, that’s a bit of an issue. I think in real life you would have some back-and-forth ability between advisors, but the complexity and unknowns of the real world would create a qualitative difference between the conversation and an actual game—which chess doesn’t have. Maybe we can either limit back-and-forth like you suggested, or just have short enough time controls that there isn’t enough time for that to get too far.