First, there is a lot packed in “makes the world much more predictable”. The only way I can envision this is taking over the world. After you do that, I’m not sure there is a lot more to do than wirehead.
But even if doesn’t involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against.
But actually, thinking more about this I’m not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model—it’s doing some simple, local mechanical update on neural connections. I’m reminded of the Blue-Minimizing robot post.
If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer?
It seems this question is posed in the wrong way, and it’s more useful to ask the question this post asks—how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.
First, there is a lot packed in “makes the world much more predictable”. The only way I can envision this is taking over the world. After you do that, I’m not sure there is a lot more to do than wirehead.
But even if doesn’t involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against.
But actually, thinking more about this I’m not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model—it’s doing some simple, local mechanical update on neural connections. I’m reminded of the Blue-Minimizing robot post.
If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer?
It seems this question is posed in the wrong way, and it’s more useful to ask the question this post asks—how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.