This is one of the most important posts ever on LW though I don’t think the implications have been fully drawn out. Specifically, this post raises serious doubts about the arguments for AI x-risk as a result of alignment mismatch and the models used to talk about that risk. It undercuts both Bostrom’s argument that an AI will have a meaningful (self-aware?) utility function and Yudkowsky’s reward button parables.
The role these two arguments play in convincing people that AI x-risk is a hard problem is to explain why, if you don’t anthropomorphize should a program that’s , say, excellent at conducting/scheduling interviews to ferret out moles in the intelligence community try to manipulate external events at all not just think about them to better catch moles? I mean it’s often the case that ppl fail to pursue their fervent goals outside familiar context. Why will AI be different? Both arguments conclude that AI will inevitably act like it’s very effectively maximizing some simple utility function in all contexts and in all ways.
Bostrom tries to convince us that as creatures get more capable they tend to act more coherently (more like they are governed by a global utility function). This is of course true for evolved creatures but by offering a theory of how value type things can arise this theory predicts that if you only train your AI in a relatively confined class of circumstances (even if that requires making very accurate predictions about the rest of the world) it isn’t going to develop that kind of simple global value but, rather, would likely find multie shards in tension without clear direction if forced to make value choices in very different circumstances. Similarly, it exains why the AI won’t just wirehead itself by pressing it’s rewaes button.
This is one of the most important posts ever on LW though I don’t think the implications have been fully drawn out. Specifically, this post raises serious doubts about the arguments for AI x-risk as a result of alignment mismatch and the models used to talk about that risk. It undercuts both Bostrom’s argument that an AI will have a meaningful (self-aware?) utility function and Yudkowsky’s reward button parables.
The role these two arguments play in convincing people that AI x-risk is a hard problem is to explain why, if you don’t anthropomorphize should a program that’s , say, excellent at conducting/scheduling interviews to ferret out moles in the intelligence community try to manipulate external events at all not just think about them to better catch moles? I mean it’s often the case that ppl fail to pursue their fervent goals outside familiar context. Why will AI be different? Both arguments conclude that AI will inevitably act like it’s very effectively maximizing some simple utility function in all contexts and in all ways.
Bostrom tries to convince us that as creatures get more capable they tend to act more coherently (more like they are governed by a global utility function). This is of course true for evolved creatures but by offering a theory of how value type things can arise this theory predicts that if you only train your AI in a relatively confined class of circumstances (even if that requires making very accurate predictions about the rest of the world) it isn’t going to develop that kind of simple global value but, rather, would likely find multie shards in tension without clear direction if forced to make value choices in very different circumstances. Similarly, it exains why the AI won’t just wirehead itself by pressing it’s rewaes button.