A belief that M has no impact will generate extremely poor predictions of the future, iff it’s an UFAI; it’s interesting to have a prescriptive belief in which Friendly agents definitionally believe a true thing and Unfriendly agents definitionally believe a false thing.
It’s a solution that seems mostly geared towards problems of active utility deception- it prevents certain cases of an AI that deliberately games a metric. To the extent that a singleton is disingenuous about its own goals, this is a neat approach. I am a little worried that this kind of deliberate deception is stretching the orthogonality thesis further than it can plausibly go; with the right kind of careful experimentation and self-analysis, an UFAI with a proscribed falsehood might derive a literally irreconcilable set of beliefs about the world. I don’t know if that would crack it down the middle or what, or how ‘fundamental’ that crack would go in its conception of reality.
We also have the problem of not being able to specify our own values with precision. If an AI will produce catastrophic results by naively following our exact instructions, then M will presumably be using the same metric, and it will give a green light on a machine that proceeds to break down all organic molecules for additional stock market construction projects or something. I suppose that this isn’t really the sort of problem that you’re trying to solve, but it is a necessary limitation in M even though M is fairly passive.
Whenever I use the colloquial phrase “the AI believes a false X” I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.
A belief that M has no impact will generate extremely poor predictions of the future, iff it’s an UFAI; it’s interesting to have a prescriptive belief in which Friendly agents definitionally believe a true thing and Unfriendly agents definitionally believe a false thing.
It’s a solution that seems mostly geared towards problems of active utility deception- it prevents certain cases of an AI that deliberately games a metric. To the extent that a singleton is disingenuous about its own goals, this is a neat approach. I am a little worried that this kind of deliberate deception is stretching the orthogonality thesis further than it can plausibly go; with the right kind of careful experimentation and self-analysis, an UFAI with a proscribed falsehood might derive a literally irreconcilable set of beliefs about the world. I don’t know if that would crack it down the middle or what, or how ‘fundamental’ that crack would go in its conception of reality.
We also have the problem of not being able to specify our own values with precision. If an AI will produce catastrophic results by naively following our exact instructions, then M will presumably be using the same metric, and it will give a green light on a machine that proceeds to break down all organic molecules for additional stock market construction projects or something. I suppose that this isn’t really the sort of problem that you’re trying to solve, but it is a necessary limitation in M even though M is fairly passive.
Whenever I use the colloquial phrase “the AI believes a false X” I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.