What is wrong with the CIV IV example? Being surprising is not actually a requirement to test your theory against; if anything it should be an anti-requirement, theories should pass the central-example tests first and those are the unsurprising examples. And indeed instrumental convergence is a very ordinary everyday thing which should be unsurprising in the vast majority of cases.
Like, if the win condition were to take over the world, then sure, CIV would be a bad example. But that’s not actually the win condition in CIV. (At least not in CIV V/VI, I haven’t played CIV IV.)
The thing I think you should do here is take any game with in-game resources/currency, and examine the instrumental convergence of resource/currency acquisition in an agent trained to play that game. Surely there are dozens of such examples already. E.g. just off the top of my head, alphastar should definitely work. Or in the case of CIV IV, acquisition of gold/science/food/etc.
What is wrong with the CIV IV example? Being surprising is not actually a requirement to test your theory against
I’m interesting in cases where there is a correct non-power-maximizing solution. For winning at Civ IV, taking over the world is the intended correct outcome. I’m hoping to find examples like the Strawberry Problem, where there is a correct non-world-conquering outcome (duplicate the strawberry) and taking over the world (in order to e.g. maximize scientific research on strawberry duplication) is an unwanted side-effect.
If I built a DWIM AI, told it to win at CIV IV and it did this, I would conclude it was misaligned. Which is precisely why SpiffingBrit’s videos are so fun to watch (huge fan).
I think the correct solution to the strawberry problem would also involve a ton of instrumental convergence? You’d need to collect resources to do research/engineering, then set up systems to experiment with strawberries/biotech, then collect generally applicable information on strawberry duplication, and then apply that to duplicate the strawberry.
If I ask an AI to duplicate a strawberry and it takes of the world, I would consider that misaligned. Obviously it will require some instrumental convergence (resources, intelligence, etc) to duplicate a strawberry. An aligned AI should either duplicate the strawberry while staying within a “budget” for how many resources it consumes, or say “I’m sorry, I can’t do that”.
I would recommend you read my post on corrigibility which describes how we can mathematically define a tradeoff between success and resource exploitation.
Questionable—turning the universe into paperclips really is the optimal solution to the “make as many paperclips as possible” problem. But yeah, obviously in Civ IV taking over the world isn’t even an instrumental goal—it’s just the actual goal.
I think he just wants an example of an agent being rewarded for something simple (like being rewarded for resource collection) exhibiting power-seeking behavior to the degree that it takes over the game environment. It’s an intuitive difference to a lot of people to an agent specifically maximizing an objective. I actually can’t name an example after looking for an hour, but I would bet money something like that already exists.
My guess is that if you plop two Starcraft AIs on a board and reward them every time they gather resources, with enough training, they would start fighting each other for control of the map. I would also guess that someone has already done this exact scenario. Is there an AI search engine for Reddit anyone would recommend?
I think he just wants an example of an agent being rewarded for something simple (like being rewarded for resource collection) exhibiting power-seeking behavior to the degree that it takes over the game environment.
That’s definitely not what “instrumental convergence” means, in general. So:
Is there a reason to be interested in that phenomenon, rather than instrumental convergence more generally?
Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition [1].
So, the standard central examples of instrumental convergence are self-preservation and resource acquisition. If the OP is asking for examples of “instrumental convergence”, and resource acquisition is not the kind of thing they’re asking for, then the thing they’re asking for is not instrumental convergence (or is at least a much narrower category than instrumental convergence).
If the OP is looking for a pattern like “AI trained at <some goal> ends up ‘taking over the world’”, then that would be an example of instrumental convergence, but it’s a much narrower category than instrumental convergence in general. Asking for “examples of instrumental convergence”, if you actually want examples of AI trained at some random goal “taking over the world” (whatever that means), is confusing in the same way as asking for examples of cars when in fact you want an example of a 2005 red Toyota Camry.
And if people frequently want to talk about 2005 red Toyota Camry specifically, and the word they’re using is “car” (which is already mostly used to mean something else), then that strongly suggests we need a new word.
I see your point. Maybe something like “resource domination” or just “instrumental resource acquisition” is a better term for what he is looking for, I think.
What is wrong with the CIV IV example? Being surprising is not actually a requirement to test your theory against; if anything it should be an anti-requirement, theories should pass the central-example tests first and those are the unsurprising examples. And indeed instrumental convergence is a very ordinary everyday thing which should be unsurprising in the vast majority of cases.
Like, if the win condition were to take over the world, then sure, CIV would be a bad example. But that’s not actually the win condition in CIV. (At least not in CIV V/VI, I haven’t played CIV IV.)
The thing I think you should do here is take any game with in-game resources/currency, and examine the instrumental convergence of resource/currency acquisition in an agent trained to play that game. Surely there are dozens of such examples already. E.g. just off the top of my head, alphastar should definitely work. Or in the case of CIV IV, acquisition of gold/science/food/etc.
I’m interesting in cases where there is a correct non-power-maximizing solution. For winning at Civ IV, taking over the world is the intended correct outcome. I’m hoping to find examples like the Strawberry Problem, where there is a correct non-world-conquering outcome (duplicate the strawberry) and taking over the world (in order to e.g. maximize scientific research on strawberry duplication) is an unwanted side-effect.
Kinda trolling, but:
If I built a DWIM AI, told it to win at CIV IV and it did this, I would conclude it was misaligned. Which is precisely why SpiffingBrit’s videos are so fun to watch (huge fan).
I think the correct solution to the strawberry problem would also involve a ton of instrumental convergence? You’d need to collect resources to do research/engineering, then set up systems to experiment with strawberries/biotech, then collect generally applicable information on strawberry duplication, and then apply that to duplicate the strawberry.
If I ask an AI to duplicate a strawberry and it takes of the world, I would consider that misaligned. Obviously it will require some instrumental convergence (resources, intelligence, etc) to duplicate a strawberry. An aligned AI should either duplicate the strawberry while staying within a “budget” for how many resources it consumes, or say “I’m sorry, I can’t do that”.
I would recommend you read my post on corrigibility which describes how we can mathematically define a tradeoff between success and resource exploitation.
Questionable—turning the universe into paperclips really is the optimal solution to the “make as many paperclips as possible” problem. But yeah, obviously in Civ IV taking over the world isn’t even an instrumental goal—it’s just the actual goal.
I think he just wants an example of an agent being rewarded for something simple (like being rewarded for resource collection) exhibiting power-seeking behavior to the degree that it takes over the game environment. It’s an intuitive difference to a lot of people to an agent specifically maximizing an objective. I actually can’t name an example after looking for an hour, but I would bet money something like that already exists.
My guess is that if you plop two Starcraft AIs on a board and reward them every time they gather resources, with enough training, they would start fighting each other for control of the map. I would also guess that someone has already done this exact scenario. Is there an AI search engine for Reddit anyone would recommend?
That’s definitely not what “instrumental convergence” means, in general. So:
Is there a reason to be interested in that phenomenon, rather than instrumental convergence more generally?
If so, perhaps we need a different name for it?
What is the difference between that and instrumental convergence?
From the LW wiki page:
So, the standard central examples of instrumental convergence are self-preservation and resource acquisition. If the OP is asking for examples of “instrumental convergence”, and resource acquisition is not the kind of thing they’re asking for, then the thing they’re asking for is not instrumental convergence (or is at least a much narrower category than instrumental convergence).
If the OP is looking for a pattern like “AI trained at <some goal> ends up ‘taking over the world’”, then that would be an example of instrumental convergence, but it’s a much narrower category than instrumental convergence in general. Asking for “examples of instrumental convergence”, if you actually want examples of AI trained at some random goal “taking over the world” (whatever that means), is confusing in the same way as asking for examples of cars when in fact you want an example of a 2005 red Toyota Camry.
And if people frequently want to talk about 2005 red Toyota Camry specifically, and the word they’re using is “car” (which is already mostly used to mean something else), then that strongly suggests we need a new word.
I see your point. Maybe something like “resource domination” or just “instrumental resource acquisition” is a better term for what he is looking for, I think.