This strategy is supposed to make it that instead of failing by sending bad messages, its failure mode is by just shutting down.
If all works well, it answers normally, and if it doesn’t work, it doesn’t do anything because it expects nobody will listed. As opposed to an oracle that, if it messes up its own programming, will try to manipulate people with its answers.
False positives are vastly better than false negatives when testing for friendliness though. In the case of an oracle AI, friendliness includes a desire to answer questions truthfully regardless of the consequences to the outside world.
friendliness includes a desire to answer questions
Which definition of Friendliness are you referring to? I have a feeling you’re treating Friendliness as a sack into which you throw whatever you need at the moment...
Fair enough, let me try to rephrase that without using the word friendliness:
We’re trying to make a superintelligent AI that answers all of our questions accurately but does not otherwise influence the world and has no ulterior motives beyond correctly answering questions that we ask of it.
If we instead accidentally made an AI that decides that it is acceptable to (for instance) manipulate us into asking simpler question so that it can answer more of them, it is preferable that it doesn’t believe anyone is listening to the answers it gives because that is one less way it has for interacting with the outside world.
It is a redundant safeguard. With it, you might end up with a perfectly functioning AI that does nothing, without it, you may end up with an AI that is optimizing the world in an uncontrolled manner.
it is preferable that it doesn’t believe anyone is listening to the answers it gives
I don’t think so. As I mentioned in another subthread here, I consider separating what an AI believes (e.g. that no one is listening) from what it actually does (e.g. answer questions) to be a bad idea.
In that case, you expect it to send no messages.
This strategy is supposed to make it that instead of failing by sending bad messages, its failure mode is by just shutting down.
If all works well, it answers normally, and if it doesn’t work, it doesn’t do anything because it expects nobody will listed. As opposed to an oracle that, if it messes up its own programming, will try to manipulate people with its answers.
Well, yes, except that you can have a perfectly good entirely Friendly AI which just shuts down because nobody listens, so why bother?
You’re not testing for Friendliness, you’re testing for the willingness to continue the irrational waste of bits and energy.
False positives are vastly better than false negatives when testing for friendliness though. In the case of an oracle AI, friendliness includes a desire to answer questions truthfully regardless of the consequences to the outside world.
Which definition of Friendliness are you referring to? I have a feeling you’re treating Friendliness as a sack into which you throw whatever you need at the moment...
Fair enough, let me try to rephrase that without using the word friendliness:
We’re trying to make a superintelligent AI that answers all of our questions accurately but does not otherwise influence the world and has no ulterior motives beyond correctly answering questions that we ask of it.
If we instead accidentally made an AI that decides that it is acceptable to (for instance) manipulate us into asking simpler question so that it can answer more of them, it is preferable that it doesn’t believe anyone is listening to the answers it gives because that is one less way it has for interacting with the outside world.
It is a redundant safeguard. With it, you might end up with a perfectly functioning AI that does nothing, without it, you may end up with an AI that is optimizing the world in an uncontrolled manner.
I don’t think so. As I mentioned in another subthread here, I consider separating what an AI believes (e.g. that no one is listening) from what it actually does (e.g. answer questions) to be a bad idea.