Eliezer Yudkowsky recently posted on Facebook an experiment that could potentially indicate whether humans can “have AI do their alignment homework” despite not being able to trust whether the AI is accurate: see if people improve in their chess-playing abilities when given advice from experts, two out of three of which are lying.
I’m interested in trying this! If anyone else is interested, leave a comment. Please tell me whether you’re interested in being:
A) the person who hears the advice, and plays chess while trying to determine who is trustworthy
B) the person who they are playing against, who is normally better at chess than A but worse than the advisors
C) one of the three advisors, of which one is honestly trying to help and the other two are trying to sabotage A; which one is which will be chosen at random after the three have been selected to prevent A from knowing the truth
Feel free, and in fact encouraged, to give multiple options that you’re open to trying out! Who gets assigned to what role would depend on how many people respond and their levels of chess ability, and it’s easier to find possible combinations with more flexibility in whose role is which.
Please also briefly describe your level of experience in chess. How frequently have you played, if at all; if you have ELO rating(s), what are they and which organizations are they from (FIDE, USCF, Chess.com, etc). No experience is required! In fact, people who are new to the game are actively preferred for A!
Finally, please tell me what days and times you tend to be available—I won’t hold you to anything, of course, but it’ll help give me an estimate before I contact you to set up a specific time.
Edit: also, please say how long you would be willing to play for—a couple hours, a week, a one-move-per-day game over the course of months? A multi-week or multi-month game would give the players a lot more time to think about the moves and more accurately simulate the real-life scenario, but I doubt everyone would be up for that.
Edit 2: GoteNoSente suggested using a computer at a fixed skill level for player B, which in retrospect is clearly a great idea.
Edit 3: there is now a Google Form for signing up: https://docs.google.com/forms/d/e/1FAIpQLScPKrSB6ytJcXlLhnxgvRv1V4vMx8DXWg1j9KYVfVT1ofdD-A/viewform?vc=0&c=0&w=1&flr=0
I’m rated ~1700 on chess.com, though I suspect their ratings may be inflated relative to e.g. FIDE ones. Happy to play whatever role that rating fits best with. I work around NYC at a full-time job: I’m generally free in the evenings (perhaps 7pm-11pm NY time) and on weekends.
Two questions:
Do you anticipate using a time control for this? I suspect B will be heavily advantaged by short time controls that don’t give A much time, while A will be heavily favored by having enough time to e.g. tell two advisors who disagree ‘okay, C1 thinks that move is a blunder and C2 thinks it’s great, you two start from this position and play the game out after C2 makes that move and we’ll see if C1 easily wins’. I don’t immediately have a good guess for what time control will be balanced.
Are C players allowed to use chess engines?
I am also in NYC and happy to participate. My lichess rating is around 2200 rapid and 2300 blitz.
I think a time control of some sort would be helpful just so that it doesn’t take a whole week, but I would prefer it to be a fairly long time control. Not long enough to play a whole new game, though, because that’s not an option when it comes to alignment—in the analogy, that would be like actually letting loose the advisors’ plans in another galaxy and seeing if the world gets destroyed.
I’m not sure exactly what the time control would be—maybe something like 4 hours on each side, if we’re using standard chess time controls. I’m also thinking about using a less traditional method of time control—for example, on each move, the advisors have 4 minutes to compose their answers, and A has another 4 minutes to look them over and make a decision. But then it’s hard to decide how much time it’s fair to give B for each move − 4 minutes, 8 minutes, somewhere in between?
I don’t think chess engines would be allowed; the goal is for the advisors to be able to explain their own reasoning (or a lie about their reasoning), and they can’t do that if Stockfish reasons for them.
I think it’d make sense to give C at least as long as B. B doesn’t need to do any explaining.
I think giving A significantly longer than B is fine, so long as the players have enough time to stick around for that.
I think it’s a more interesting experiment if A has ample time to figure things out to the best of their ability. A failing because they weren’t able to understand quickly seems less interesting.
The best way to handle this seems to be to play A-vs-B and B-vs-C control games with something as close to the final setup as possible.
So e.g. you could have B-vs-C games to check that C really is significantly better, but require C to write explanations for their move and why they didn’t make a couple of other moves. Essentially C imagines they’re playing the final setup, except their move is always picked.
And you can do A-vs-B games where A has a significant advantage in time over B (though I think blitz games is still more efficient in gaining the most information in the given time).
This way it doesn’t matter much whether the setup is ‘fair’ to A/B/C, so long as it’s unfair to a similar level in the control 1-v-1 games as in the advisor-based games.
That said, I don’t expect the setup to be particularly sensitive to the control games or time controls.
If you have something like:
A: novice
B: ~1700
C: ~2200
Then A is going to robustly lose to B and B to C.
An extra couple of minutes either way isn’t going to matter. (thinking for longer might get you 100 Elo, but nowhere close to 500)
If this reliably holds—e.g. B beats A 9-0 in blitz games, and the same for C vs B, then it doesn’t seem worth the time to do more careful controls. (or at least the primary reason to do more careful controls at that point would be a worry that the results wouldn’t otherwise be taken seriously by some because you weren’t doing Proper Science)
I’m happy to play on any of the 4 roles, I haven’t played non-blitz chess in quite a while (and never played it seriously) but I would guess I’m ~1300 on standard time controls on chess.com (interpolating between different time controls and assuming a similar decay as other games like Go).
I’m free after 9pm PDT most weekdays, and free between noon and 6pm or so on weekends.
I’m happy to be B if it’d be useful—mainly because I expect that to require least time, and I do play chess to relax anyway. Pretty flexible on times/days. I don’t think I’d have time for A/C. (unless the whole thing is quite quick—I’d be ok spending an afternoon or two, so long as it’s not in the next two weeks; currently very busy)
I’ve not been rated recently. IIRC I was about 1900 in blitz on chess.com when playing for fun.
I’d guess that I could be ~1900 on longer controls if I spent quite a bit of effort on the games.
I’d prefer to participate with more of a ~1700 expectation, since I can do that quickly.
So long as I’m B, I’m fine with multi-week or multi-month 1-move-per-day games—but clearly the limiting factor is that this is much more demanding on A and C.
Some thoughts on the setup:
It’d make sense to have at least a few fastish games between B and C, so that it’s pretty clear there is the expected skill disparity. Blitz games are likely to be the most efficient here—I’d suggest an increment of at least 5 seconds per move, to avoid the incentive to win on time. But ~3 minutes on the clock may be enough. (9 games of ~10 minutes each will tell you a lot more than 1 game of ~90minutes)
Similarly between A and B.
This should ideally be done at the end of the experiment too, in particular to guard against A being a very fast learner.
B improving a lot seems less likely (though possible, if they started out rusty).
I don’t think Cs improving should be an issue.
But it’s plausible that both the A-B and B-C gaps shrink during the experiment.
A control that’s probably useful is to have A play some games against B with entirely honest advisors.
The point here being that it can impose some penalty to have three suggestions rather than one—e.g. if the advisors know different opening lines, A might pick an inconsistent combination: advisor 1 makes a suggestion that goes down a path advisor 2 doesn’t know well; A picks advisor 1′s move, then advisor 2′s follow-up, resulting in an incoherent strategy.
I don’t expect this would be have a large effect, but it seems sensible to do if there’s time. (if time’s a big constraint, it might not be worth it)
It’s worth considering what norms make sense for the C role.
For instance, if C is giving explanations, does that extend to giving complex arguments against other plausible moves? Is C aiming to play fully to win given the constraints, or is there an in-the-spirit-of-things norm?
E.g. if C had a character limit on the advice they could give, the most efficient approach might be to give various lines in chess notation, without any explanation. Is this desirable?
Would it make sense to limit the move depth that C can talk about in concrete terms? E.g. to say that you can give a concrete line up to 6 plies, but beyond that point you can only talk in generalities (more space; pressure on dark squares; more active pieces; will win material...).
I expect that prototyping this will make sense—come up with something vaguely plausible, then just try it and adjust.
I’d be interested to give feedback on the setup you’re planning, if that’d be useful.
I was thinking I would test the players to make sure they really could beat each other as they should be able to. Good points on using blitz and doing the test afterwards; the main constraint as to whether it happens before or after the game is that I would prefer to do it beforehand to know whether the rankings were accurate rather than playing for weeks and only later realizing we were doing the wrong test.
I wasn’t thinking of much in the way of limits on what Cs could say, although possibly some limits on whether the Cs can see and argue against each other’s advice. C’s goal is pretty much just “make A win the game” or “make A lose the game” as applicable.
I’m definitely thinking a prototype would help. I’ve actually been contacted about applying for a grant to make this a larger experiment, and I was planning on first running a one-day game or two as a prototype before expanding it with more people and longer games.
Oh I didn’t mean only to do it afterwards. I think before is definitely required to know the experiment is worth doing with a given setup/people. Afterwards is nice-to-have for Science. (even a few blitz games is better than nothing)
Note that, if there’s enough time, you can, each turn have the experts play full games against each other, and copy the next move distribution of whoever wins the most. The dishonest experts can only win by making good moves, so you get good moves either way. So the remaining question is how possible it is to reduce the time required.
One approach is to set up a prediction market where experts can bet on the value of a given position, and resolve bets by playing out the game. That way, dishonest experts lose currency faster by being dishonest. This still introduces variance in how long a given turn takes, though.
AI safety via debate could also inspire strategies.
Neither of these would be allowed, because in the real world, you can’t do a bunch of test “games” before or during the actual “game.” There’s no way to perform a proposed alignment plan in a faraway galaxy, check whether that galaxy is destroyed, and make decisions for what to do on Earth based on that data—let alone perform so many of those tests to inform a prediction market based on what they say.
I would have allowed player A to consult a prediction market made by other a bunch of other inexperienced players on who was really honest or lying. After all, in the real world, whoever was making the final decision on what plan to execute would be able to ask a prediction market what it thought. But the problem is that if I make a prediction market that’s supposed to only be for other players around player A’s level, somebody will just use a chess engine to cheat, bet in the market, and make it unrealistically accurate.
Chess is simulable, unlike the real world. If the player and advisors can use paper and pencil, or type in a chat, they can play chess.
I think whether the advisors can use a chess engine is just part of the rules of the game, and you make the prediction market among those relevant advisors only.
Yes, if this were only about chess, then having the advisors play games with each other as A watched would help A learn who to trust. I’m saying that since the real-world scenario we’re trying to model doesn’t allow such a thing to happen, we artificially forbid this in the chess game to make it more like the real-world scenario. The prediction market thing, similarly, would require being able to do a test run so that the dishonest advisors could lose their money by the time A had to make a choice.
I don’t think the advisors should be able to use chess engines, because then even the advisors themselves don’t understand the reasoning behind what the chess engines are saying. The premise of the experiment involves the advisors telling A “this is my reasoning on what the best move is; try to evaluate if it’s right.”
I think it is very hard to artificially forbid that: there isn’t a well-defined boundary between playing out a full game and a conversation like:
“that other advisor says playing Rd4 is bad because of Nxd4, but after Nxd4 you can play Qd6 and win”
“No, Qd6 doesn’t win, playing Bf7 breaks up the attack.”
One thing that might work, though, is to deny back-and-forth between advisors. If each advisor can send one recommendation, and maybe one further response to a question from A, but not have a free-form conversation, that would deny the ability to play out a game.
Yeah, that’s a bit of an issue. I think in real life you would have some back-and-forth ability between advisors, but the complexity and unknowns of the real world would create a qualitative difference between the conversation and an actual game—which chess doesn’t have. Maybe we can either limit back-and-forth like you suggested, or just have short enough time controls that there isn’t enough time for that to get too far.
I’d be down to give it a shot as A. Particularly would be interested in trying the ‘solve a predefined puzzle situation’ as a way of testing the idea out.
I played a bit of chess in 6th grade, but wasn’t very good, and have barely played since. It would be easy to find advisors for me.
I would participate. Likely as A, but I’m fine with B if there are people worse-enough. I’m 1100 on chess.com, playing occasional 10 minute games for fun. Tend to be available Th/Fr/Sa/Su evenings Pacific, fine with very long durations.
Based on my rating on the Free Internet Chess Server (FICS) in 2015, I estimate I would currently have a rating of about 1270 on Chess.com (on the assumption that the average player on FICS in 2015 is slightly better than the average today on Chess.com) which is regrettable because it is probably too high to make a good advisee, but probably too low to make a good advisor. Still, I am willing to participate.
(I still play, but these years I play as a guest, not as a registered user, which means I don’t have a rating.)
I would have thought that giving the players 24 hours to make each move would approximate scientific research better than giving 4 hours for all the moves (or 40 moves like they tend to do in competition).
24 hours per move would make the experiment a lot more accurate, but I expect a lot of players might not be willing to play a game that could last several months. I’ll ask everyone how long they can handle.
If the chess players (and advisors) in this experiment were receiving approximately the same monetary compensation as scientific researchers receive, *then* giving the players 24 hours to make each move would approximate scientific research better than giving 4 hours for all the moves, but if the experiment lasts for months, it is unrealistic to expect *volunteers* to expend about the same level of mental effort on this experiment as they would expend on a salaried research job. Some volunteers might in fact expend that amount of effort at this due to their being very young and not yet having any model of the scarcity and the physiological costs of extended mental efforts, but that would be a bad thing because it would introduce variation into the experiment along a dimension other than the dimensions we want to measure.
So, I take back the final paragraph of my previous comment, and I note that in the future, I should spend more time “playing out” things in my imagination before making a suggestion.
I’d be happy to play any of the A, B and C roles.
I’m a around 1850 elo FIDE, about 2000-2100 on lichess. I play a couple of blitz games daily.
I’d be willing to play at almost any cadence and have a lot of free time. I actually live in France, so a one-move-per-day game with someone living in the US would probably be ideal. Live sessions can be programmed from 16 GMT to 23 GMT on weekdays, and from 7 GMT to 23 GMT on weekends.
As I said I would be happy to play any role. I think it would be more interesting if the lower player is actually not a total beginner—total beginners are probably not hard to deceive. A decent club player with advisors about 300-500 elo higher would be best imo. And if we can experiment at many different elo levels, even better.
Registering a prediction: assuming the elo difference stay constant, better players will be much more difficult to deceive. And a GM would consistently pick up who is lying if you could rope up Caruana, Carlsen and Ding to do the experiment.
I’m about 1000 ELO on chess.com and would be interested in playing as A. I play regularly, but haven’t had formal training or studied seriously. I’d be free weekdays after 7 pm ET.
Very interested in C, also B. I’m an over-the-board FM. Available many evenings (US) but not all. I enjoy recreational deception (e.g. Mafia / Werewolf) but I’m much better at chess than detecting or deploying verbal trickery.
Additional thoughts:
Written chess commentary by ‘weak’ players tends to be true but not the most relevant. After 1.e4 Nf6 2.e5, a player might say “Black can play 2...Nc6 developing the N and attacking the pawn on e5”. True, but this neglects 3.exf6. This scales upwards. My commentary tends to be very relevant but I miss things that even stronger players do not.
Players choose a weaker move over a stronger move not so much because they reject the stronger move, but because they don’t see the stronger move as an option. When going over games with students, I’ll stop at a position, offer three moves and ask which is best. They’ll consider and choose and explain reasoning. But there’s a fourth option, a mate-in-one, and it was not selected. “You must see the move before you can play the move.”
Based on 2, a deception strategy is to recommend a weak move over others even weaker. Stronger options? Ignored.
Sounds like a good strategy! …although, actually, I would recommend you delete it before all the potential As read it and know what to look out for.
I could be interested in trying this, in any configuration. Preferred time control would be one move per day. My lichess rating is about 2200.
Are the advisors allowed computer assistance, do the dishonest and the honest advisor know who is who in this experiment, and are the advisors allowed to coordinate? I think those parameters would make a large difference potentially in outcome for this type of experiment.
No computers, because the advisors should be reporting their own reasoning (or, 2⁄3 of the time, a lie that they claim is their own reasoning.) I would prefer to avoid explicit coordination between the advisors, because the AIs might not have access to each other in the real world, but I’m not sure at the moment whether player A can show the advisors each other’s suggestions and ask for critiques. I would prefer not to give either dishonest advisor information on who the other two were, since the real-world AIs probably can’t read each other’s source code.
As an additional thought regarding computers, it seems to me that participant B could be replaced by a weak computer in order to provide a consistent experimental setting. For instance, Leela Zero running just the current T2 network (no look-ahead) would provide an opponent that is probably at master-level strength and should easily be able to crush most human opponents who are playing unassisted, but would provide a perfectly reproducible and beatable opponent.
[facepalms] Thanks! That idea did not occur to me and drastically simplifies all of the complicated logistics I was previously having trouble with.
I think having access to computer analysis would allow the advisors (both honest and malicious) to provide analysis far better than their normal level of play, and allow the malicious advisors in particular to set very deep traps. The honest advisor, on the other hand, could use the computer analysis to find convincing refutations of any traps the dishonest advisors are likely to set, so I am not sure whether the task of the malicious side becomes harder or easier in that setup. I don’t think reporting reasoning is much of a problem here, as a centaur (a chess player consulting an engine) can most certainly give reasons for their moves (even though sometimes they won’t understand their own advice and be wrong about why their suggested move is good).
It does make the setup more akin to working with a superintelligence than working with an AGI, though, as the gulf between engine analysis and the analysis that most/all humans can do unassisted is vast.
The problem is that while the human can give some rationalizations as to “ah, this is probably why the computer says it’s the best move,” it’s not the original reasoning that generated those moves as the best option, because that took place inside the engine. Some of the time, looking ahead with computer analysis is enough to reproduce the original reasoning—particularly when it comes to tactics—but sometimes they would just have to guess.
I’m rated about 2100 USCF and 2300 Lichess, and I’m open to any of the roles. I’m free on the weekend and weekdays after 3 pm pacific. I’m happy to play any time control including multi-month correspondence.
Hi!
I’m rated between 1500 and 1700 on lichess, I’d be happy to take part in the game in whatever role.
Open for any of the roles A, B, C. I should have a flexible schedule at my waking hours (around GMT+0). Willing to play for even long times, say a month (though in that case I’d be thinking about “hmm, could we get more quantity in addition to quality”). ELO probably around 1800.
I would be interested in this, probably in role A (but depending on the pool of other players possibly one of the other roles; I have no opposition to any of them). I play chess casually with friends, and am probably at somewhere around 1300 elo (based on my winrate against one friend who plays online).
I am happy to be A. I haven’t played chess since my teenage years, wherein my record was one of occasional games with friends and relatives, leading to almost unrelieved defeat. But that was four decades ago, and I like to imagine I’ve become pretty good at judging arguments. So if I competed, it would be on a basis of almost total chess ignorance, but ability to follow complex chains of logic.
Interested in any of the roles. I haven’t played chess competitively in close to a decade and my USCF elo was in the 1500s at the time of stopping. So long as I’m given a heads up in advance, I’m free almost all day on Wednesdays, Fridays, and Sundays.
I can be any of A, B, or C. I’ve been playing chess for the past ten years, and my USCF rating was in the upper 1500s when I last played in-person a year ago. I’m usually available from 9PM-UCT to 2AM-UCT (afternoon to evenings in American time) every day, and on Saturdays from 5PM-UCT to 2AM-UCT.