You can imagine a version of Stockfish which does that—a chessplayer which, if it’s sure it can win anyways, will start letting you have a pawn or two—but it’s not simpler to build.
I think it sometimes is simpler to build? Simple RL game-playing agents sometimes exhibit exactly that sort of behavior, unless you make an explicit effort to train it out of them.
For example, HexHex is a vaguely-AlphaGo-shaped RL agent for the game of Hex. The reward function used to train the agent was “maximize the assessed probability of winning”, not “maximize the assessed probability of winning, and also go hard even if that doesn’t affect the assessed probability of winning”. In their words:
We found it difficult to train the agent to quickly end a surely won game. When you play against the agent you’ll notice that it will not pick the quickest path to victory. Some people even say it’s playing mean ;-) Winning quickly simply wasn’t part of the objective function! We found that penalizing long routes to victory either had no effect or degraded the performance of the agent, depending on the amount of penalization. Probably we haven’t found the right balance there.
Along similar lines, the first attack on KataGo found by Wang et al in Adversarial Policies Beat Superhuman Go AIs was the pass-adversary. The pass-adversary first sets up a losing board position where it controls a small amount of territory and KataGo has a large amount of territory it would end up controlling if the game was played out fully. However, KataGo chooses to pass, since it assesses that the probability of winning from that position is similar if it does or does not make a move, and then the pass-adversary also passes, ending the game and winning by a quirk of the scoring rules.
Similarly, there isn’t an equally-simple version of GPT-o1 that answers difficult questions by trying and reflecting and backing up and trying again, but doesn’t fight its way through a broken software service to win an “unwinnable” capture-the-flag challenge. It’s all just general intelligence at work.
I suspect that a version of GPT-o1 that is tuned to answer difficult questions in ways that human raters would find unsurprising would work just fine. I think “it’s all just general intelligence at work” is a semantic stop sign, and if you dig into what you mean by “general intelligence at work” you get to the fiddly implementation details of how the agent tries to solve the problem. So you may for example see an OODA-loop-like structure like
Assess the situation
Figure out what affordances there are for doing things
For each of the possible actions, figure out what you would expect the outcome of that action to be. Maybe figure out ways it could go wrong, if you’re feeling super advanced.
Choose one of the actions, or choose to give up if no sufficiently good action is available
Do the action
Determine how closely the result matches what you expect
An agent which “goes hard”, in this case, is one which leans very strongly against the “give up” action in step 4. However, I expect that if you have some runs where the raters would have hoped for a “give up” instead of the thing the agent actually did, it would be pretty easy to generate a reinforcement signal which makes the agent more likely to mash the “give up” button in analogous situations without harming performance very much in other situations. I also expect that would generalize.
As a note, “you have to edit the service and then start the modified service” is the sort of thing I would be unsurprised to see in a CTF challenge, unless the rules of the challenge explicitly said not to do that. (Inner Eliezer “and then someone figures out how to put their instance of an AI in a CTF-like context with a take-over-the-world goal, and then we all die.” If the AI instance in that context is also much more capable that all of the other instances everyone else has, I agree that that is an existentially relevant threat. But I expect that agents which execute “achieve the objective at all costs” will not be all that much more effective than agents which execute “achieve the objective at all reasonable costs, using only sane unsurprising actions”, so the reason the agent goes hard and the reason the agent is capable are not the same reason.)
But that is not the default outcome when OpenAI tries to train a smarter, more salesworthy AI.
I think you should break out “smarter” from “more salesworthy”. In terms of “smarter”, optimizing for task success at all costs is likely to train in patterns of bad behavior. In terms of “more salesworthy”, businesses are going to care a lot about “will explain why the goal is not straightforwardly achievable rather than executing galaxy-brained evil-genie plans”. As such, a modestly smart Do What I Mean and Check agent is a much easier sell than a superintelligent evil genie agent.
If an AI is easygoing and therefore can’t solve hard problems, then it’s not the most profitable possible AI, and OpenAI will keep trying to build a more profitable one.
I expect the tails come apart along the “smart” and “profitable” axes.
Yes, I’m not so sure either about the stockfish-pawns point.
In Michael Redmond’s AlphaGo vs AlphaGo series on YouTube, he often finds the winning AI carelessly loses points in the endgame. It might have a lead of 1.5 or 2.5 points, 20 moves before the game ends; but by the time the game ends, has played enough suboptimal moves to make itself win by 0.5 - the smallest possible margin.
It never causes itself to lose with these lazy moves; only reduces its margin of victory. Redmond theorizes, and I agree, that this is because the objective is to win, not maximize point differential, and at such a late stage of the game, its victory is certain regardless.
This is still a little strange—the suboptimal moves do not sacrifice points to reduce variance, so it’s not like it’s raising p(win). But it just doesn’t care either way; a win is a win.
There are Go AI that are trained with the objective of maximizing point difference. I am told they are quite vicious, in a way that AlphaGo isn’t. But the most famous Go AI in our timeline turned out to be the more chill variant.
I think it sometimes is simpler to build? Simple RL game-playing agents sometimes exhibit exactly that sort of behavior, unless you make an explicit effort to train it out of them.
For example, HexHex is a vaguely-AlphaGo-shaped RL agent for the game of Hex. The reward function used to train the agent was “maximize the assessed probability of winning”, not “maximize the assessed probability of winning, and also go hard even if that doesn’t affect the assessed probability of winning”. In their words:
Along similar lines, the first attack on KataGo found by Wang et al in Adversarial Policies Beat Superhuman Go AIs was the pass-adversary. The pass-adversary first sets up a losing board position where it controls a small amount of territory and KataGo has a large amount of territory it would end up controlling if the game was played out fully. However, KataGo chooses to pass, since it assesses that the probability of winning from that position is similar if it does or does not make a move, and then the pass-adversary also passes, ending the game and winning by a quirk of the scoring rules.
I suspect that a version of GPT-o1 that is tuned to answer difficult questions in ways that human raters would find unsurprising would work just fine. I think “it’s all just general intelligence at work” is a semantic stop sign, and if you dig into what you mean by “general intelligence at work” you get to the fiddly implementation details of how the agent tries to solve the problem. So you may for example see an OODA-loop-like structure like
Assess the situation
Figure out what affordances there are for doing things
For each of the possible actions, figure out what you would expect the outcome of that action to be. Maybe figure out ways it could go wrong, if you’re feeling super advanced.
Choose one of the actions, or choose to give up if no sufficiently good action is available
Do the action
Determine how closely the result matches what you expect
An agent which “goes hard”, in this case, is one which leans very strongly against the “give up” action in step 4. However, I expect that if you have some runs where the raters would have hoped for a “give up” instead of the thing the agent actually did, it would be pretty easy to generate a reinforcement signal which makes the agent more likely to mash the “give up” button in analogous situations without harming performance very much in other situations. I also expect that would generalize.
As a note, “you have to edit the service and then start the modified service” is the sort of thing I would be unsurprised to see in a CTF challenge, unless the rules of the challenge explicitly said not to do that. (Inner Eliezer “and then someone figures out how to put their instance of an AI in a CTF-like context with a take-over-the-world goal, and then we all die.” If the AI instance in that context is also much more capable that all of the other instances everyone else has, I agree that that is an existentially relevant threat. But I expect that agents which execute “achieve the objective at all costs” will not be all that much more effective than agents which execute “achieve the objective at all reasonable costs, using only sane unsurprising actions”, so the reason the agent goes hard and the reason the agent is capable are not the same reason.)
I think you should break out “smarter” from “more salesworthy”. In terms of “smarter”, optimizing for task success at all costs is likely to train in patterns of bad behavior. In terms of “more salesworthy”, businesses are going to care a lot about “will explain why the goal is not straightforwardly achievable rather than executing galaxy-brained evil-genie plans”. As such, a modestly smart Do What I Mean and Check agent is a much easier sell than a superintelligent evil genie agent.
I expect the tails come apart along the “smart” and “profitable” axes.
Yes, I’m not so sure either about the stockfish-pawns point.
In Michael Redmond’s AlphaGo vs AlphaGo series on YouTube, he often finds the winning AI carelessly loses points in the endgame. It might have a lead of 1.5 or 2.5 points, 20 moves before the game ends; but by the time the game ends, has played enough suboptimal moves to make itself win by 0.5 - the smallest possible margin.
It never causes itself to lose with these lazy moves; only reduces its margin of victory. Redmond theorizes, and I agree, that this is because the objective is to win, not maximize point differential, and at such a late stage of the game, its victory is certain regardless.
This is still a little strange—the suboptimal moves do not sacrifice points to reduce variance, so it’s not like it’s raising p(win). But it just doesn’t care either way; a win is a win.
There are Go AI that are trained with the objective of maximizing point difference. I am told they are quite vicious, in a way that AlphaGo isn’t. But the most famous Go AI in our timeline turned out to be the more chill variant.