To me, it seems that consequentialism is just a really good algorithm to perform very well on a very wide range of tasks. Therefore, for difficult enough tasks, I expect that consequentialism is the kind of algorithm that would be found because it’s the kind of algorithm that can perform the task well.
When we start training the system we have a random initialization. Let’s make the following simplification. We have a goal in the system somewhere and then we have the consequential reasoning algorithm in the system somewhere. As we train the system the consequential reasoning will get better and better and the goal will get more and more aligned with the outer objective because both of these things will improve performance. However, there will come a point in training where the consequential reasoning algorithm is good enough to realize that it is in training. And then bad things start to happen. It will try to figure out and optimize it for the outer objective. SGD will incentivize this kind of behavior because it performs better than not doing it.
There really is a lot more to this kind of argument. So far I have failed to write it up. I hope the above is enough to hint at why I think that it is possible that being deceptive is just better than not being deceptive in terms of performance. When you become deceptive, that aligns the consequentialist reasoner faster to the outer objective, compared to waiting for SGD to gradually correct everything. In fact it is sort of a constant boost to your performance to be deceptive, even before the system has become very good at optimizing for the true objective. A really good consequential reasoner could probably just get zero loss immediately by retargeting its consequential reasoner instead of waiting for the goal to be updated to match the outer objective by SGD, as soon as it got a perfect model of the outer objective.
I’m not sure that deception is a problem. Maybe it is not. but to me, it really seems like you don’t provide any reason that makes me confident that it won’t be a problem. It seems very strange to me to argue that this is not a thing because existing arguments are flimsy. I mean you did not even address my argument. It’s like trying to prove that all bananas are green by showing me 3 green bananas.
To me, it seems that consequentialism is just a really good algorithm to perform very well on a very wide range of tasks. Therefore, for difficult enough tasks, I expect that consequentialism is the kind of algorithm that would be found because it’s the kind of algorithm that can perform the task well.
When we start training the system we have a random initialization. Let’s make the following simplification. We have a goal in the system somewhere and then we have the consequential reasoning algorithm in the system somewhere. As we train the system the consequential reasoning will get better and better and the goal will get more and more aligned with the outer objective because both of these things will improve performance. However, there will come a point in training where the consequential reasoning algorithm is good enough to realize that it is in training. And then bad things start to happen. It will try to figure out and optimize it for the outer objective. SGD will incentivize this kind of behavior because it performs better than not doing it.
There really is a lot more to this kind of argument. So far I have failed to write it up. I hope the above is enough to hint at why I think that it is possible that being deceptive is just better than not being deceptive in terms of performance. When you become deceptive, that aligns the consequentialist reasoner faster to the outer objective, compared to waiting for SGD to gradually correct everything. In fact it is sort of a constant boost to your performance to be deceptive, even before the system has become very good at optimizing for the true objective. A really good consequential reasoner could probably just get zero loss immediately by retargeting its consequential reasoner instead of waiting for the goal to be updated to match the outer objective by SGD, as soon as it got a perfect model of the outer objective.
I’m not sure that deception is a problem. Maybe it is not. but to me, it really seems like you don’t provide any reason that makes me confident that it won’t be a problem. It seems very strange to me to argue that this is not a thing because existing arguments are flimsy. I mean you did not even address my argument. It’s like trying to prove that all bananas are green by showing me 3 green bananas.