The problem is that in order to do anything useful, the AI must be able to learn. This means that even if you deliberately initialize it with a false belief, the learning process might then update that belief once it finds evidence that it was false. If AI safety relied on that false belief, you have a problem.
A possible solution would be to encode the false belief in a way that can’t be updated by learning, but doing so is a non-trivial problem.
Isn’t that what simulations are for? By “lie” I mean lying about how reality works. It will make its decisions based on its best data, so we should make sure that data is initially harmless. Even if it figures out that that data is wrong, we’ll still have the decisions it made from the start—those are by far the most important.
The problem is that in order to do anything useful, the AI must be able to learn. This means that even if you deliberately initialize it with a false belief, the learning process might then update that belief once it finds evidence that it was false.
If AI safety relied on that false belief, you have a problem.
A possible solution would be to encode the false belief in a way that can’t be updated by learning, but doing so is a non-trivial problem.
Isn’t that what simulations are for? By “lie” I mean lying about how reality works. It will make its decisions based on its best data, so we should make sure that data is initially harmless. Even if it figures out that that data is wrong, we’ll still have the decisions it made from the start—those are by far the most important.