Instrumental Convergence Bounty
I have yet to find a real-world example that I can test my corrigibility definition on. Hence, I will send $100 to the first person who can send/show me an example of instrumental convergence that is:
Surprising, in the sense that the model was trained on a goal other than “maximize money” “maximize resources” or “take over the world”
Natural, in the sense that instrumental convergence arose while trying to do some objective, not with the goal in advance being to show instrumental convergence
Reproducible, in the sense that I could plausibly run the model+environment+whatever else is needed to show instrumental convergence on a box I can rent on lambda
Example of what would count as a valid solution:
I was training an agent to pick apples in Terraria and it took over the entire world in order to convert it into a massive apple orchard
Example that would fail because it is not surprising
I trained an agent to play CIV IV and it took over the world
Example that would fail because it is not natural
After reading your post, I created a toy model where an agent told to pick apples tiles the world with apple trees
Example that would fail because it is not reproducible
The US military did a simulation in which they trained a drone which decided to take out is operator
Update
The bounty has been claimed by Hastings.
Obviously I would still appreciate more examples, but there won’t be a 2nd bounty (or if I do create one in the future it will have more requirements attached).
- 14 Sep 2023 14:04 UTC; 14 points) 's comment on AI #29: Take a Deep Breath by (
- 15 Sep 2023 12:44 UTC; 3 points) 's comment on AI #29: Take a Deep Breath by (
Hi! I might have something close. The chess engine Stockfish has a heuristic for what it wants, with manually specified values for how much it wants to keep its bishops, or open up lines of attack, or connect its rooks, etc. I tried to modify this function to make it want to advance the king up the board, by adding a direct reward for every step forward the king takes. At low search depths, this leads it to immediately move the king forward, but at high search depths it mostly just attacks the other player in order to make it safe to move the king, ( best defense is offense) and only starts moving the king late in the game. I wasn’t trying to demonstrate instrumental convergence, in fact this behavior was quite annoying as it was ruining my intented goal (creating fake games demonstrating the superiority of the bongcloud opening)
modified stockfish: https://github.com/HastingsGreer/stockfish
This was 8 years ago, so I’m fuzzy on the details. If it sounds like vaguely what you’re looking for, reply to let me know and I’ll write this up with some example games and make sure the code still runs.
This sounds like exactly the type of thing I’m looking for.
Do you know, in the “convergent” case, is the score just for advancing the king, or is there a possibility that there’s some boring math explanation like “as the number of steps calculated increases, the weights on the other positions (bishops, lines of attack, etc) overwhelms the score for the king advancing”?
I think I fully lobotomized the evaluation function to only care about advancing the king, except that it still evaluates checkmate as +infinty. Here’s a sample game:
https://www.chesspastebin.com/view/24278
It doesn’t really understand material anymore except for the queen, which I guess is powerful enough that it wants to preserve it to allow continued king pushing. I still managed to lose because I’m not very good at chess.
EDIT: the installation guide I listed below was too many steps and involved blindly trusting internet code, that’s silly. Instead just threw it up on lichess and you can play it in the browser here: https://lichess.org/@/king-forward-bot
If you want to play yourself,you can compile the engine withgit clone https://github.com/HastingsGreer/stockfish
cd stockfish/src
make
and then can install a gui like xboard (on mac, brew install xboard) and add your stockfish binary as an UCI engine.
This looks great!
I will DM you to figure out how to send the bounty.
It would be more elegant to remove the checkmate penalty. After all, checkmate is instrumentally bad, because on the next move, the king is captured and can no longer move forward any spaces—unless he’s advanced to the final row* and so has maximized his movement score, and so why care about what happens after that? Nothing will increase (or decrease) his movement reward.
* I take by ‘advancing the king’ you mean to ‘moving to a new unvisited’ row, so it maxes at 8, for the 8 rows of the chess board, and not simply moving at all, since that would have degenerate solutions like building a fortress and then the king simply moving back and forth between two spaces for eternity.
I agree! The stockfish codebase handles evaluation of checkmates somewhere else in the code, so that would be a bit more work, but it’s definitely the correct next step.
That’s great. “The king can’t fetch the coffee if he’s dead”
What is wrong with the CIV IV example? Being surprising is not actually a requirement to test your theory against; if anything it should be an anti-requirement, theories should pass the central-example tests first and those are the unsurprising examples. And indeed instrumental convergence is a very ordinary everyday thing which should be unsurprising in the vast majority of cases.
Like, if the win condition were to take over the world, then sure, CIV would be a bad example. But that’s not actually the win condition in CIV. (At least not in CIV V/VI, I haven’t played CIV IV.)
The thing I think you should do here is take any game with in-game resources/currency, and examine the instrumental convergence of resource/currency acquisition in an agent trained to play that game. Surely there are dozens of such examples already. E.g. just off the top of my head, alphastar should definitely work. Or in the case of CIV IV, acquisition of gold/science/food/etc.
I’m interesting in cases where there is a correct non-power-maximizing solution. For winning at Civ IV, taking over the world is the intended correct outcome. I’m hoping to find examples like the Strawberry Problem, where there is a correct non-world-conquering outcome (duplicate the strawberry) and taking over the world (in order to e.g. maximize scientific research on strawberry duplication) is an unwanted side-effect.
Kinda trolling, but:
If I built a DWIM AI, told it to win at CIV IV and it did this, I would conclude it was misaligned. Which is precisely why SpiffingBrit’s videos are so fun to watch (huge fan).
I think the correct solution to the strawberry problem would also involve a ton of instrumental convergence? You’d need to collect resources to do research/engineering, then set up systems to experiment with strawberries/biotech, then collect generally applicable information on strawberry duplication, and then apply that to duplicate the strawberry.
If I ask an AI to duplicate a strawberry and it takes of the world, I would consider that misaligned. Obviously it will require some instrumental convergence (resources, intelligence, etc) to duplicate a strawberry. An aligned AI should either duplicate the strawberry while staying within a “budget” for how many resources it consumes, or say “I’m sorry, I can’t do that”.
I would recommend you read my post on corrigibility which describes how we can mathematically define a tradeoff between success and resource exploitation.
Questionable—turning the universe into paperclips really is the optimal solution to the “make as many paperclips as possible” problem. But yeah, obviously in Civ IV taking over the world isn’t even an instrumental goal—it’s just the actual goal.
I think he just wants an example of an agent being rewarded for something simple (like being rewarded for resource collection) exhibiting power-seeking behavior to the degree that it takes over the game environment. It’s an intuitive difference to a lot of people to an agent specifically maximizing an objective. I actually can’t name an example after looking for an hour, but I would bet money something like that already exists.
My guess is that if you plop two Starcraft AIs on a board and reward them every time they gather resources, with enough training, they would start fighting each other for control of the map. I would also guess that someone has already done this exact scenario. Is there an AI search engine for Reddit anyone would recommend?
That’s definitely not what “instrumental convergence” means, in general. So:
Is there a reason to be interested in that phenomenon, rather than instrumental convergence more generally?
If so, perhaps we need a different name for it?
What is the difference between that and instrumental convergence?
From the LW wiki page:
So, the standard central examples of instrumental convergence are self-preservation and resource acquisition. If the OP is asking for examples of “instrumental convergence”, and resource acquisition is not the kind of thing they’re asking for, then the thing they’re asking for is not instrumental convergence (or is at least a much narrower category than instrumental convergence).
If the OP is looking for a pattern like “AI trained at <some goal> ends up ‘taking over the world’”, then that would be an example of instrumental convergence, but it’s a much narrower category than instrumental convergence in general. Asking for “examples of instrumental convergence”, if you actually want examples of AI trained at some random goal “taking over the world” (whatever that means), is confusing in the same way as asking for examples of cars when in fact you want an example of a 2005 red Toyota Camry.
And if people frequently want to talk about 2005 red Toyota Camry specifically, and the word they’re using is “car” (which is already mostly used to mean something else), then that strongly suggests we need a new word.
I see your point. Maybe something like “resource domination” or just “instrumental resource acquisition” is a better term for what he is looking for, I think.
Once you understand how it works, it’s no longer surprising.
Take collecting keys in Montezuma’s Revenge. If framed simply as “I trained an AI to take actions that increase the score, and it learned how to collect keys that will only be useful later,” then plausibly it’s a surprising example of learning instrumentally useful actions. But if it’s “I trained an AI to construct a model of the world and then explore options in that model with the eventual goal of getting high reward, and rewarded it for increasing the score,” then it’s no longer so surprising—if you understand why it does what it does, it’s not so surprising.
Questions:
Would something like an agent trained to maximize minerals mined in Starcraft learning to attack other players to monopolize their resources count?
I assume it would count if that same agent was just rewarded every time it mined minerals, or the mineral count went up, without an explicit objective to maximize the amount of minerals it has?
Would a gridworld example work? How complex does the simulation have to be?
I’m probably going to be a stickler about 2. “not with the goal in advance being to show instrumental convergence” meaning that the example can’t be something written in response to this post (though I reserve the right to suspend this if the example is really good).
The reason being, I’m pretty sure that I personally could create such a gridworld simulation. But “I solved instrumental convergence in this toy example I created myself” wouldn’t convince me as an outsider that anything impressive had been done.
Maybe you would accept this paper, which was discussed quite a bit at the time: Emergent Tool Use From Multi-Agent Autocurricula
The AI learns to use a physics engine glitch in order to win a game. I am thinking of the behavior at 2:36 in this video. The code is available on github here. I didn’t try to run it myself, so I do not know how easy to run or complete it is.
As to whether the article matches your other criteria:
The goal of the article was to get the AI to find new behaviors, so it might not count as purely natural. But it seems the physics glitch was not planned. So it did come as a surprise.
Maybe glitching the physics to win at hide and seek is not a sufficiently general behavior to count as a case of instrumental convergence.
I won’t blame you if you think this doesn’t count.
If I was merely looking for examples of RL does something unexpected, I would not have created the bounty.
I’m interested in the idea that AI trained on totally unrelated tasks will converge on the specific set of goals described in the article on instrumental convergence