I mentioned two seemingly valid approaches, that would lead to different beliefs for the human, and asked how the AI could choose between them. You then went up a level of meta, to preferences over the deliberative process itself.
The AI was choosing what text to show Petrov. I suggested the AI choose the text based on the features that would lead Petrov (or an appropriate idealization) to say that one text or the other is better, e.g. informativeness, concision, etc. I wouldn’t describe that as “going up a level of meta.”
But I don’t think the meta preferences are more likely to be consistent—if anything, probably less so. And the meta-meta-preferences are likely to be completely underdefined, except in a few philosophers.
It seems to me like Petrov does have preferences about descriptions that the AI could provide, e.g. views about which are accurate, useful, and non-manipulative. And he probably has views about what ways of thinking about things are going to improve accuracy. If you want to call those “meta preferences” then you can do that, but then why think that those are undefined?
Also it’s not like we are passing to the meta level to avoid inconsistencies in the object level. It’s that Petrov’s object level preference looks like “option #1 is better than option #2, but ‘whichever option I’d pick after thinking for a while’ is better than either of them”
Doing corrigibility without keeping an eye on the outcome seems, to me, to be similar to many failed AI safety approach—focusing on the local “this sounds good”, rather than on the global “but it may cause extinction of sentient life”.
This doesn’t seem right to me.
Though we are assuming that neither the AI nor the human is supposed to look at the conclusion, this may just result in either a random walk, or an optimisation pressure by hidden processes inside the definition.
Thinking about a problem without knowing the answer in advance is quite common. The fact that you don’t know the answer doesn’t mean that it’s a random walk. And the optimization pressure isn’t hidden—when I try to answer a question by thinking harder about it, there is a huge amount of optimization pressure to get to the right answer, it’s just that it doesn’t take the form of knowing which answer is correct and then backwards chaining from that to figure out what deliberative process would lead to the correct answer.
I think we have a strong intuitive disagreement here, that explains our varying judgements.
I think we both agree on the facts that a) there is a sense of corrigibility for humans interacting with humans in typical situations, and b) there are thought experiments (eg a human given more time to reflect) that extend this beyond typical situations.
We possibly also agree on c) corrigibility is not uniquely defined.
I intuitively feel that there is not a well defined version of corrigibility that works for arbitrary agents interacting with arbitrary agents, or even for arbitrary agents interacting with humans (except for one example, see below).
One of the reasons for this is my experience in how intuitive human concepts are very hard to scale up—at least, without considering human preferences. See this comment for an example in the “low impact” setting.
So corrigibility feels like it’s in the same informal category as low impact. It also has a lot of possible contradictions in how its applied, depending on which corrigibility preferences and meta-preferences the AI choose to use. Contradictions are opportunities for the AI to choose an outcome, randomly or with some optimisation pressure.
But haven’t I argued that human preferences themselves are full of contradictions? Indeed; and resolving these contradictions is an important part of the challenge. But I’m much more optimistic about getting to a good place, for human preferences, when explicitly resolving the contradictions in human overall preferences—rather than when resolving the contradictions in human corrigibility preferences (and if “human corrigibility preferences” include enough general human preferences to make it safe—is this really corrigibility we’re talking about?).
To develop that point slightly, I see optimising for anything that doesn’t include safety or alignment as likely to sacrifice safety or alignment; so optimising for corrigibility will either sacrifice them, or the concepts of safety (and most of alignment) are already present in “corrigibility”.
I do know one version of corrigibility that makes sense, which explicitly looks at the outcome of what human preferences will be, and attempts to minimise the rigging of this process. That’s one of the reasons I keep coming back to the outcome.
I would prefer if you presented an example of a setup, maybe one that had some corrigibility-like features, rather than having a general setup and saying “corrigibility will solve the problems with this”.
The AI was choosing what text to show Petrov. I suggested the AI choose the text based on the features that would lead Petrov (or an appropriate idealization) to say that one text or the other is better, e.g. informativeness, concision, etc. I wouldn’t describe that as “going up a level of meta.”
It seems to me like Petrov does have preferences about descriptions that the AI could provide, e.g. views about which are accurate, useful, and non-manipulative. And he probably has views about what ways of thinking about things are going to improve accuracy. If you want to call those “meta preferences” then you can do that, but then why think that those are undefined?
Also it’s not like we are passing to the meta level to avoid inconsistencies in the object level. It’s that Petrov’s object level preference looks like “option #1 is better than option #2, but ‘whichever option I’d pick after thinking for a while’ is better than either of them”
This doesn’t seem right to me.
Thinking about a problem without knowing the answer in advance is quite common. The fact that you don’t know the answer doesn’t mean that it’s a random walk. And the optimization pressure isn’t hidden—when I try to answer a question by thinking harder about it, there is a huge amount of optimization pressure to get to the right answer, it’s just that it doesn’t take the form of knowing which answer is correct and then backwards chaining from that to figure out what deliberative process would lead to the correct answer.
I think we have a strong intuitive disagreement here, that explains our varying judgements.
I think we both agree on the facts that a) there is a sense of corrigibility for humans interacting with humans in typical situations, and b) there are thought experiments (eg a human given more time to reflect) that extend this beyond typical situations.
We possibly also agree on c) corrigibility is not uniquely defined.
I intuitively feel that there is not a well defined version of corrigibility that works for arbitrary agents interacting with arbitrary agents, or even for arbitrary agents interacting with humans (except for one example, see below).
One of the reasons for this is my experience in how intuitive human concepts are very hard to scale up—at least, without considering human preferences. See this comment for an example in the “low impact” setting.
So corrigibility feels like it’s in the same informal category as low impact. It also has a lot of possible contradictions in how its applied, depending on which corrigibility preferences and meta-preferences the AI choose to use. Contradictions are opportunities for the AI to choose an outcome, randomly or with some optimisation pressure.
But haven’t I argued that human preferences themselves are full of contradictions? Indeed; and resolving these contradictions is an important part of the challenge. But I’m much more optimistic about getting to a good place, for human preferences, when explicitly resolving the contradictions in human overall preferences—rather than when resolving the contradictions in human corrigibility preferences (and if “human corrigibility preferences” include enough general human preferences to make it safe—is this really corrigibility we’re talking about?).
To develop that point slightly, I see optimising for anything that doesn’t include safety or alignment as likely to sacrifice safety or alignment; so optimising for corrigibility will either sacrifice them, or the concepts of safety (and most of alignment) are already present in “corrigibility”.
I do know one version of corrigibility that makes sense, which explicitly looks at the outcome of what human preferences will be, and attempts to minimise the rigging of this process. That’s one of the reasons I keep coming back to the outcome.
I would prefer if you presented an example of a setup, maybe one that had some corrigibility-like features, rather than having a general setup and saying “corrigibility will solve the problems with this”.