I think the key insight here is that the brain is not inner aligned, not even close
You say that but don’t elaborate further in the comment. Which learned human values go against the base optimizer values (pleasure, pain, learning).
Avoiding wireheading doesn’t seem like failed inner alignment—avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful. The base optimizer is also searching for brain configurations which make good predictions about the world, and wireheading goes against that.
It’s possible to construct a wireheading scenario that avoids these objections. E.g., imagine it’s a “pleasure maximizing” AI that does the wireheading and ensures that the total amount of future pleasure is very high. We can even suppose that the AI makes the world much more predictable as well.
Despite leading to a lot of pleasure and making it possible to have very good predictions about the world, that really doesn’t seem like a successfully aligned AI to me.
First, there is a lot packed in “makes the world much more predictable”. The only way I can envision this is taking over the world. After you do that, I’m not sure there is a lot more to do than wirehead.
But even if doesn’t involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against.
But actually, thinking more about this I’m not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model—it’s doing some simple, local mechanical update on neural connections. I’m reminded of the Blue-Minimizing robot post.
If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer?
It seems this question is posed in the wrong way, and it’s more useful to ask the question this post asks—how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.
Avoiding wireheading doesn’t seem like failed inner alignment—avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful.
Even if this is the case, this is not why (most) humans don’t want to wirehead, in the same way that their objection to killing an innocent person whose organs could save 10 other people are not driven by some elaborate utilitarian arguments that this would be bad for the society.
You say that but don’t elaborate further in the comment. Which learned human values go against the base optimizer values (pleasure, pain, learning).
Avoiding wireheading doesn’t seem like failed inner alignment—avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful. The base optimizer is also searching for brain configurations which make good predictions about the world, and wireheading goes against that.
It’s possible to construct a wireheading scenario that avoids these objections. E.g., imagine it’s a “pleasure maximizing” AI that does the wireheading and ensures that the total amount of future pleasure is very high. We can even suppose that the AI makes the world much more predictable as well.
Despite leading to a lot of pleasure and making it possible to have very good predictions about the world, that really doesn’t seem like a successfully aligned AI to me.
First, there is a lot packed in “makes the world much more predictable”. The only way I can envision this is taking over the world. After you do that, I’m not sure there is a lot more to do than wirehead.
But even if doesn’t involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against.
But actually, thinking more about this I’m not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model—it’s doing some simple, local mechanical update on neural connections. I’m reminded of the Blue-Minimizing robot post.
If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer?
It seems this question is posed in the wrong way, and it’s more useful to ask the question this post asks—how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.
Even if this is the case, this is not why (most) humans don’t want to wirehead, in the same way that their objection to killing an innocent person whose organs could save 10 other people are not driven by some elaborate utilitarian arguments that this would be bad for the society.