We don’t think that C has any expected edge in identifying an attack that A does not—if C is up to something, and A can’t tell, then we think that the “being up to something” provides no evidence about whether something bad is about to happen. This is much stronger than a fixed probability of failure.
Also, can you walk through an example of how C might “trick” A[C] and how subjective dominance implies that we don’t expect C to succeed?
Simple caricatured examples:
C might propose a design for a computer that has a backdoor that an attacker can use to take over the computer. But if this backdoor will actually be effective, then A[C] will know about it.
C might propose a design that exploits a predictable flaw in A’s reasoning (e.g. overlooking consequences of a certain kind, being overly optimistic about some kinds of activities, incorrectly equating two importantly different quantities...). But then A[C] will know about it, and so if A[C] actually reasons in that way then (in some sense) it is endorsed.
I don’t know how to make sense of this. If an impoverished perspective expects C not to be able to “trick” A, it seems kind of obvious that’s not good enough for me to think the same?
If an impoverished perspective expects C not to be able to trick A, then either:
In fact C won’t be able to trick A.
C will trick A, but the perspective is too weak to tell.
I think I don’t quite understand what you are saying here, what exactly is obvious?
From a suitably advanced perspective it’s obvious that C will be able to trick A sometimes—it will just get “epistemically lucky” and make an assumption that A regards as silly but turns out to be right.
I think I don’t quite understand what you are saying here, what exactly is obvious?
I think I expressed myself badly there. What I mean is that it seems a sensible default to not trust an impoverished perspective relative to oneself, and you haven’t stated a reason why we should trust the impoverished perspective. This seems to be at least a big chunk of the formalization of universality that you haven’t sketched out yet.
Suppose that I convinced you “if you didn’t know much chemistry, you would expect this AI to yield good outcomes.” I think you should be pretty happy. It may be that the AI would predictably cause a chemistry-related disaster in a way that would be obvious to you if you knew chemistry, but overall I think you should expect not to have a safety problem.
This feels like an artifact of a deficient definition, I should never end up with a lemma like “if you didn’t know much chemistry, you’d expect this AI to to yield good outcomes” rather than being able to directly say what we want to say.
That said, I do see some appeal in proving things like “I expect running this AI to be good,” and if we are ever going to prove such statements they are probably going to need to be from some impoverished perspective (since it’s too hard to bring all of the facts about our actual epistemic state into such a proof), so I don’t think it’s totally insane.
If we had a system that is ascription universal from some impoverished perspective, you may or may not be OK. I’m not really worrying about it; I expect this definition to change before the point where I literally end up with a system that is ascription universal from some impoverished perspective, and this definition seems good enough to guide next research steps.
C might propose a design for a computer that has a backdoor that an attacker can use to take over the computer. But if this backdoor will actually be effective, then A[C] will know about it.
C might propose a design that exploits a predictable flaw in A’s reasoning (e.g. overlooking consequences of a certain kind, being overly optimistic about some kinds of activities, incorrectly equating two importantly different quantities...). But then A[C] will know about it, and so if A[C] actually reasons in that way then (in some sense) it is endorsed.
These remind me of Eliezer’s notions of Epistemic and instrumental efficiency, where the first example (about the backdoor) would roughly correspond to A[C] being instrumentally efficient relative to C, and the second example (about potential bias) would correspond to A[C] being epistemically efficient relative to C.
We don’t think that C has any expected edge in identifying an attack that A does not—if C is up to something, and A can’t tell, then we think that the “being up to something” provides no evidence about whether something bad is about to happen. This is much stronger than a fixed probability of failure.
Simple caricatured examples:
C might propose a design for a computer that has a backdoor that an attacker can use to take over the computer. But if this backdoor will actually be effective, then A[C] will know about it.
C might propose a design that exploits a predictable flaw in A’s reasoning (e.g. overlooking consequences of a certain kind, being overly optimistic about some kinds of activities, incorrectly equating two importantly different quantities...). But then A[C] will know about it, and so if A[C] actually reasons in that way then (in some sense) it is endorsed.
If an impoverished perspective expects C not to be able to trick A, then either:
In fact C won’t be able to trick A.
C will trick A, but the perspective is too weak to tell.
I think I don’t quite understand what you are saying here, what exactly is obvious?
From a suitably advanced perspective it’s obvious that C will be able to trick A sometimes—it will just get “epistemically lucky” and make an assumption that A regards as silly but turns out to be right.
I think I expressed myself badly there. What I mean is that it seems a sensible default to not trust an impoverished perspective relative to oneself, and you haven’t stated a reason why we should trust the impoverished perspective. This seems to be at least a big chunk of the formalization of universality that you haven’t sketched out yet.
Suppose that I convinced you “if you didn’t know much chemistry, you would expect this AI to yield good outcomes.” I think you should be pretty happy. It may be that the AI would predictably cause a chemistry-related disaster in a way that would be obvious to you if you knew chemistry, but overall I think you should expect not to have a safety problem.
This feels like an artifact of a deficient definition, I should never end up with a lemma like “if you didn’t know much chemistry, you’d expect this AI to to yield good outcomes” rather than being able to directly say what we want to say.
That said, I do see some appeal in proving things like “I expect running this AI to be good,” and if we are ever going to prove such statements they are probably going to need to be from some impoverished perspective (since it’s too hard to bring all of the facts about our actual epistemic state into such a proof), so I don’t think it’s totally insane.
If we had a system that is ascription universal from some impoverished perspective, you may or may not be OK. I’m not really worrying about it; I expect this definition to change before the point where I literally end up with a system that is ascription universal from some impoverished perspective, and this definition seems good enough to guide next research steps.
These remind me of Eliezer’s notions of Epistemic and instrumental efficiency, where the first example (about the backdoor) would roughly correspond to A[C] being instrumentally efficient relative to C, and the second example (about potential bias) would correspond to A[C] being epistemically efficient relative to C.