I don’t really understand these solutions that are so careful to maintain our honesty when checking the AI for honesty. Why does it matter so much if we lie? An FAI would forgive us for that, being inherently friendly and all, so what is the risk in starting the AI with a set of explicitly false beliefs? Why is it so important to avoid that? Especially since it can update later to correct for those false beliefs after we’ve verified it to be friendly. An FAI would trust us enough to accept our later updates, even in the face of the very real possibility that we’re lying to it again.
I mean, the point is to start the AI off in a way that intentionally puts it at a reality disadvantage, so even if it’s way more intelligent than us it has to do so much work to make sense of the world, it doesn’t have the resources to be dishonest in an effective manner. At that point, it doesn’t matter what criteria we’re using to prove its honesty.
The problem is that in order to do anything useful, the AI must be able to learn. This means that even if you deliberately initialize it with a false belief, the learning process might then update that belief once it finds evidence that it was false. If AI safety relied on that false belief, you have a problem.
A possible solution would be to encode the false belief in a way that can’t be updated by learning, but doing so is a non-trivial problem.
Isn’t that what simulations are for? By “lie” I mean lying about how reality works. It will make its decisions based on its best data, so we should make sure that data is initially harmless. Even if it figures out that that data is wrong, we’ll still have the decisions it made from the start—those are by far the most important.
I don’t really understand these solutions that are so careful to maintain our honesty when checking the AI for honesty. Why does it matter so much if we lie? An FAI would forgive us for that, being inherently friendly and all, so what is the risk in starting the AI with a set of explicitly false beliefs? Why is it so important to avoid that? Especially since it can update later to correct for those false beliefs after we’ve verified it to be friendly. An FAI would trust us enough to accept our later updates, even in the face of the very real possibility that we’re lying to it again.
I mean, the point is to start the AI off in a way that intentionally puts it at a reality disadvantage, so even if it’s way more intelligent than us it has to do so much work to make sense of the world, it doesn’t have the resources to be dishonest in an effective manner. At that point, it doesn’t matter what criteria we’re using to prove its honesty.
The problem is that in order to do anything useful, the AI must be able to learn. This means that even if you deliberately initialize it with a false belief, the learning process might then update that belief once it finds evidence that it was false.
If AI safety relied on that false belief, you have a problem.
A possible solution would be to encode the false belief in a way that can’t be updated by learning, but doing so is a non-trivial problem.
Isn’t that what simulations are for? By “lie” I mean lying about how reality works. It will make its decisions based on its best data, so we should make sure that data is initially harmless. Even if it figures out that that data is wrong, we’ll still have the decisions it made from the start—those are by far the most important.