I think things are already fine for any spike outside S, e.g. paperclip maximiser, since non-obstruction doesn’t say anything there.
I actually think saying “our goals aren’t on a spike” amounts to a stronger version of my [assume humans know what the AI knows as the baseline]. I’m now thinking that neither of these will work, for much the same reason. (see below)
The way I’m imagining spikes within S is like this: We define a pretty broad S, presumably implicitly, hoping to give ourselves a broad range of non-obstruction.
For all P in U we later conclude that our actual goals are in T⊂U ⊂S. We optimize for AU on T, overlooking some factors that are important for P in U \ T. We do better on T than we would have by optimising more broadly over U (we can cut corners in U \ T). We do worse on U \ T since we weren’t directly optimising for that set (AU on U \ T varies quite a lot). We then get an AU spike within U, peaking on T.
The reason I don’t think telling the AI something like “our goals aren’t on a spike” will help, is that this would not be a statement about our goals, but about our understanding and competence. It’d be to say that we never optimise for a goal set we mistakenly believe includes our true goals (and that we hit what we aim for similarly well for any target within S).
It amounts to saying something like “We don’t have blind-spots”, “We won’t aim for the wrong target”, or, in the terms above, “We will never mistake any T for any U”. In this context, this is stronger and more general than my suggestion of “assume for the baseline that we know everything you know”. (lack of that knowledge is just one way to screw up the optimisation target)
In either case, this is equivalent to telling the AI to assume an unrealistically proficient/well-informed pol. The issue is that, as far as non-obstruction is concerned, the AI can then take actions which have arbitrarily bad consequences for us if we don’t perform as well as pol. I.e. non-obstruction then doesn’t provide any AU guarantee if our policy isn’t actually that good.
My current intuition is that anything of the form “assume our goals aren’t on a spike”, “assume we know everything you know”… only avoid creating other serious problems if they’re actually true—since then the AI’s prediction of pol’s performance isn’t unrealistically high.
Even for “we know everything you know”, that’s a high bar if it has to apply when the AI is off. For “our goals aren’t on a spike”, it’s an even higher bar.
If we could actually make it true that our goals weren’t on a spike in this sense, that’d be great. I don’t see any easy way to do that. [Perhaps if the ability to successfully optimise for S already puts such high demands on our understanding, that distinguishing Ts from Us is comparatively easy.… That seems unlikely to me.]
I think things are already fine for any spike outside S, e.g. paperclip maximiser, since non-obstruction doesn’t say anything there.
I actually think saying “our goals aren’t on a spike” amounts to a stronger version of my [assume humans know what the AI knows as the baseline]. I’m now thinking that neither of these will work, for much the same reason. (see below)
The way I’m imagining spikes within S is like this:
We define a pretty broad S, presumably implicitly, hoping to give ourselves a broad range of non-obstruction.
For all P in U we later conclude that our actual goals are in T ⊂ U ⊂ S.
We optimize for AU on T, overlooking some factors that are important for P in U \ T.
We do better on T than we would have by optimising more broadly over U (we can cut corners in U \ T).
We do worse on U \ T since we weren’t directly optimising for that set (AU on U \ T varies quite a lot).
We then get an AU spike within U, peaking on T.
The reason I don’t think telling the AI something like “our goals aren’t on a spike” will help, is that this would not be a statement about our goals, but about our understanding and competence. It’d be to say that we never optimise for a goal set we mistakenly believe includes our true goals (and that we hit what we aim for similarly well for any target within S).
It amounts to saying something like “We don’t have blind-spots”, “We won’t aim for the wrong target”, or, in the terms above, “We will never mistake any T for any U”.
In this context, this is stronger and more general than my suggestion of “assume for the baseline that we know everything you know”. (lack of that knowledge is just one way to screw up the optimisation target)
In either case, this is equivalent to telling the AI to assume an unrealistically proficient/well-informed pol.
The issue is that, as far as non-obstruction is concerned, the AI can then take actions which have arbitrarily bad consequences for us if we don’t perform as well as pol.
I.e. non-obstruction then doesn’t provide any AU guarantee if our policy isn’t actually that good.
My current intuition is that anything of the form “assume our goals aren’t on a spike”, “assume we know everything you know”… only avoid creating other serious problems if they’re actually true—since then the AI’s prediction of pol’s performance isn’t unrealistically high.
Even for “we know everything you know”, that’s a high bar if it has to apply when the AI is off.
For “our goals aren’t on a spike”, it’s an even higher bar.
If we could actually make it true that our goals weren’t on a spike in this sense, that’d be great.
I don’t see any easy way to do that.
[Perhaps if the ability to successfully optimise for S already puts such high demands on our understanding, that distinguishing Ts from Us is comparatively easy.… That seems unlikely to me.]