Thank you for writing this comment—it made it clearer to me what you mean by the doctrine of logical infallibility, and I think there may be a clearer way to express it.
It seems to me that you’re not getting at logical infallibility, since the AGI could be perfectly willing to be act humbly about its logical beliefs, but value infallibility or goal infallibility. An AI does not expect its goal statement to be fallible: any uncertainty in Y can only be represented by Y being a fuzzy object itself, not in the AI evaluating Y and somehow deciding “no, I was mistaken about Y.”
In the case where the Maverick Nanny is programmed to “ensure the brain chemistry of humans resembles the state extracted from this training data as much as possible,” there is no way to convince the Maverick Nanny that it is somehow misinterpreting its goal; it knows that it is are supposed to ensure perceptions about brain chemistry, and any statements you make about “true happiness” or “human rights” are irrelevant to brain chemistry, even though it might be perfectly willing to consider your advice on how to best achieve that value or manipulate the physical universe.
In the case where the AI is programmed to “do whatever your programmers tell you will make humans happy,” the AI again thinks its values are infallible: it should do what its programmers tell it to do, so long as they claim it will make humans happy. It might be uncertain about what its programmers meant, and so it would be possible to convince this AI that it misunderstood their statements, and then it would change its behavior—but it won’t be convinced by any arguments that it should listen to all of humanity, instead of its programmers.
But expressed this way, it’s not clear to me where you think the inconsistency comes in. If the AI isn’t programmed to have an ‘external conscience’ in its programmers or humanity as a whole, then their dissatisfaction doesn’t matter. If it is programmed to use them as a conscience, but the way in which it does is exploitable, then that isn’t very binding. Figuring out how to give it the right conscience / right values is the open problem that MIRI and others care about!
seems to me that you’re not getting at logical infallibility, since the AGI could be perfectly willing to be act humbly about its logical beliefs, but value infallibility or goal infallibility. An AI does not expect its goal statement to be fallible:
Which AI? As so often, an architecture dependent issue is being treated as a universal truth.
Figuring out how to give it the right conscience / right values is the open problem that MIRI and others care about!
The other mostly aren’t thinking in terms of “giving” …hardcoding ….values. There is a valid critique to be made of that assumption.
Which AI? As so often, an architecture dependent issue is being treated as a universal truth.
This statement maps to “programs execute their code.” I would be surprised if that were controversial.
The other mostly aren’t thinking in terms of “giving” …hardcoding ….values. There is a valid critique to be made of that assumption.
This was covered by the comment about “meta-values” earlier, and “Y being a fuzzy object itself,” which is probably not as clear as it could be. The goal management system grounds out somewhere, and that root algorithm is what I’m considering the “values” of the AI. If it can change its mind about what to value, the process it uses to change its mind is the actual fixed value. (If it can change its mind about how to change its mind, the fixedness goes up another level; if it can completely rewrite itself, now you have lost your ability to be confident in what it will do.)
Which AI? As so often, an architecture dependent issue is being treated as a universal truth.
This statement maps to “programs execute their code.” I would be surprised if that were controversial.
Humans can fail to realise the implications of uncontroversial statements. Humans are failing to realise that goal stability is architecture dependent.
This was covered by the comment about “meta-values” earlier, and “Y being a fuzzy object itself,” which is probably not as clear as it could be. The goal management system grounds out somewhere, and that root algorithm is what I’m considering the “values” of the AI.
But you shouldn’t be, at least in an un scare quoted sense of values. Goals and values aren’t descriptive labels for de facto behaviour. The goal if a paperclipper is to make paperclips; if it crashes, as an inevitable result of executing its code, we don’t say, ” Aha! It had the goal to crash all along”.
Goal stability doesn’t mean following code, since unstable systems follow their code too....using the actual meaning of “goal”.
Meta: trying to defend a claim by changing the meaning of its terms is doomed to failure.
Thank you for writing this comment—it made it clearer to me what you mean by the doctrine of logical infallibility, and I think there may be a clearer way to express it.
It seems to me that you’re not getting at logical infallibility, since the AGI could be perfectly willing to be act humbly about its logical beliefs, but value infallibility or goal infallibility. An AI does not expect its goal statement to be fallible: any uncertainty in Y can only be represented by Y being a fuzzy object itself, not in the AI evaluating Y and somehow deciding “no, I was mistaken about Y.”
In the case where the Maverick Nanny is programmed to “ensure the brain chemistry of humans resembles the state extracted from this training data as much as possible,” there is no way to convince the Maverick Nanny that it is somehow misinterpreting its goal; it knows that it is are supposed to ensure perceptions about brain chemistry, and any statements you make about “true happiness” or “human rights” are irrelevant to brain chemistry, even though it might be perfectly willing to consider your advice on how to best achieve that value or manipulate the physical universe.
In the case where the AI is programmed to “do whatever your programmers tell you will make humans happy,” the AI again thinks its values are infallible: it should do what its programmers tell it to do, so long as they claim it will make humans happy. It might be uncertain about what its programmers meant, and so it would be possible to convince this AI that it misunderstood their statements, and then it would change its behavior—but it won’t be convinced by any arguments that it should listen to all of humanity, instead of its programmers.
But expressed this way, it’s not clear to me where you think the inconsistency comes in. If the AI isn’t programmed to have an ‘external conscience’ in its programmers or humanity as a whole, then their dissatisfaction doesn’t matter. If it is programmed to use them as a conscience, but the way in which it does is exploitable, then that isn’t very binding. Figuring out how to give it the right conscience / right values is the open problem that MIRI and others care about!
Which AI? As so often, an architecture dependent issue is being treated as a universal truth.
The other mostly aren’t thinking in terms of “giving” …hardcoding ….values. There is a valid critique to be made of that assumption.
This statement maps to “programs execute their code.” I would be surprised if that were controversial.
This was covered by the comment about “meta-values” earlier, and “Y being a fuzzy object itself,” which is probably not as clear as it could be. The goal management system grounds out somewhere, and that root algorithm is what I’m considering the “values” of the AI. If it can change its mind about what to value, the process it uses to change its mind is the actual fixed value. (If it can change its mind about how to change its mind, the fixedness goes up another level; if it can completely rewrite itself, now you have lost your ability to be confident in what it will do.)
Humans can fail to realise the implications of uncontroversial statements. Humans are failing to realise that goal stability is architecture dependent.
But you shouldn’t be, at least in an un scare quoted sense of values. Goals and values aren’t descriptive labels for de facto behaviour. The goal if a paperclipper is to make paperclips; if it crashes, as an inevitable result of executing its code, we don’t say, ” Aha! It had the goal to crash all along”.
Goal stability doesn’t mean following code, since unstable systems follow their code too....using the actual meaning of “goal”.
Meta: trying to defend a claim by changing the meaning of its terms is doomed to failure.