What is wrong with the CIV IV example? Being surprising is not actually a requirement to test your theory against
I’m interesting in cases where there is a correct non-power-maximizing solution. For winning at Civ IV, taking over the world is the intended correct outcome. I’m hoping to find examples like the Strawberry Problem, where there is a correct non-world-conquering outcome (duplicate the strawberry) and taking over the world (in order to e.g. maximize scientific research on strawberry duplication) is an unwanted side-effect.
If I built a DWIM AI, told it to win at CIV IV and it did this, I would conclude it was misaligned. Which is precisely why SpiffingBrit’s videos are so fun to watch (huge fan).
I think the correct solution to the strawberry problem would also involve a ton of instrumental convergence? You’d need to collect resources to do research/engineering, then set up systems to experiment with strawberries/biotech, then collect generally applicable information on strawberry duplication, and then apply that to duplicate the strawberry.
If I ask an AI to duplicate a strawberry and it takes of the world, I would consider that misaligned. Obviously it will require some instrumental convergence (resources, intelligence, etc) to duplicate a strawberry. An aligned AI should either duplicate the strawberry while staying within a “budget” for how many resources it consumes, or say “I’m sorry, I can’t do that”.
I would recommend you read my post on corrigibility which describes how we can mathematically define a tradeoff between success and resource exploitation.
Questionable—turning the universe into paperclips really is the optimal solution to the “make as many paperclips as possible” problem. But yeah, obviously in Civ IV taking over the world isn’t even an instrumental goal—it’s just the actual goal.
I’m interesting in cases where there is a correct non-power-maximizing solution. For winning at Civ IV, taking over the world is the intended correct outcome. I’m hoping to find examples like the Strawberry Problem, where there is a correct non-world-conquering outcome (duplicate the strawberry) and taking over the world (in order to e.g. maximize scientific research on strawberry duplication) is an unwanted side-effect.
Kinda trolling, but:
If I built a DWIM AI, told it to win at CIV IV and it did this, I would conclude it was misaligned. Which is precisely why SpiffingBrit’s videos are so fun to watch (huge fan).
I think the correct solution to the strawberry problem would also involve a ton of instrumental convergence? You’d need to collect resources to do research/engineering, then set up systems to experiment with strawberries/biotech, then collect generally applicable information on strawberry duplication, and then apply that to duplicate the strawberry.
If I ask an AI to duplicate a strawberry and it takes of the world, I would consider that misaligned. Obviously it will require some instrumental convergence (resources, intelligence, etc) to duplicate a strawberry. An aligned AI should either duplicate the strawberry while staying within a “budget” for how many resources it consumes, or say “I’m sorry, I can’t do that”.
I would recommend you read my post on corrigibility which describes how we can mathematically define a tradeoff between success and resource exploitation.
Questionable—turning the universe into paperclips really is the optimal solution to the “make as many paperclips as possible” problem. But yeah, obviously in Civ IV taking over the world isn’t even an instrumental goal—it’s just the actual goal.