Just writing to give some context… The point of this session was to discuss an issue I see with “super-human feedback (SHF)” schemes (e.g. debate, amplification, recursive reward modelling) that use helper AIs to inform human judgments. I guess there was more of an inferential gap going into the session than I expected, so for background: let’s consider the complexity theory viewpoint in feedback (as discussed in section 2.2 of “AI safety via debate”). This implicitly assumes that we have access to a trusted (e.g. human) decision making process (TDMP), sweeping the issues that Stuart mentions under the rug.
Under this view, the goal of SHF is to efficiently emulate the TDMP, accelerating the decision-making. For example, we’d like an agent trained with SHF to be able to quickly (e.g. in a matter of seconds) make decisions that would take the TDMP billions of years to decide. But we don’t aim to change the decisions.
Now, the issue I mentioned is: there doesn’t seem to be any way to evaluate whether the SHF-trained agent is faithfully emulating the TDMP’s decisions on such problems. It seems like, naively, the best we can do is train on problems where the TDMP can make decisions quickly, so that we can use its decisions as ground truth; then we just hope that it generalizes appropriately to the decisions that take TDMP billions of years. And the point of the session was to see if people have ideas for how to do less naive experiments that would allow us to increase our confidence that a SHF-scheme would yield safe generalization to these more difficult decisions.
Imagine there are 2 copies of me, A and B. A makes a decision with some helper AIs, and independently, B makes a decision without their help. A and B make different decisions. Who do we trust? I’m more ready to trust B, since I’m worried about the helper AIs having an undesirable influence on A’s decision-making.
...So questions of how to define human preferences or values seem mostly orthogonal to this question, which is why I want to assume them away. However, our discussion did make me consider more that I was making an implicit assumption (and this seems hard to avoid), that there was some idealized decision-making process that is assumed to be “what we want”. I’m relatively comfortable with trusting idealized versions of “behavioral cloning/imitation/supervised learning” (P) or “(myopic) reinforcement learning/preference learning” (NP), compared with the SHF-schemes (PSPACE).
One insight I gleaned from our discussion is the usefulness of disentangling:
an idealized process for *defining* “what we want” (HCH was mentioned as potentially a better model of this than “a single human given as long as they want to think about the decision” (which was what I proposed using, for the purposes of the discussion)).
a means of *approximating* that definition.
From this perspective, the discussion topic was: how can we gain empirical evidence for/against this question: “Assuming that the output of a human’s indefinite deliberation is a good definition of ‘what they want’, do SHF-schemes do a good/safe job of approximating that?”
And the point of the session was to see if people have ideas for how to do less naive experiments that would allow us to increase our confidence that a SHF-scheme would yield safe generalization to these more difficult decisions.
Did anyone have ideas for this? My thinking is that you have to understand or make some assumptions about the nature of TDMP in order to have confidence about safe generalization, because if you just treat it as a black box, then it might be that for some class of queries it will do something that can’t be approximated by SHF-schemes. No matter how you test, you can only conclude that if such queries exist they are not in the test sets you used.
Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?
Regarding the question of how to do empirical work on this topic: I remember there being one thing which seemed potentially interesting, but I couldn’t find it in my notes (yet).
RE the rest of your comment: I guess you are taking issue with the complexity theory analogy; is that correct? An example hypothetical TDMP I used is “arbitrarily long deliberation” (ALD), i.e. a single human is allowed as long as they want to make the decision (I don’t think that’s a perfect “target” for alignment, but it seems like a reasonable starting point). I don’t see why ALD would (even potentially) “do something that can’t be approximated by SHF-schemes”, since those schemes still have the human making a decision.
“Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?” <-- yes, IIUC.
I don’t see why ALD would (even potentially) “do something that can’t be approximated by SHF-schemes”, since those schemes still have the human making a decision.
Suppose there’s a cryptographic hash function H inside a human brain whose algorithm is not introspectively accessible, and some secret state S which is also not introspectively accessible. The human can choose to, in each period, run S|Output := H(S|Input) and observe/report Output, so we can ask ALD, what’s Output if you iterate H n times with X as the initial Input and update S each time. (I can try to clarify if it’s not clear what I mean.) I think this can’t be approximated by SHF-schemes, because there’s no way to train ML to approximate H to serve as the baseline agent.
So what is this an analogy for? I think H could stand for human philosophical deliberation, and S for any introspectively inaccessible information in our brain that might go into and be changed by such deliberation.
Yes, please try to clarify. In particular, I don’t understand your “|” notation (as in “S|Output”).
I realized that I was a bit confused in what I said earlier. I think it’s clear that (proposed) SHF schemes should be able to do at least as well as a human, given the same amount of time, because they have human “on top” (as “CEO”) who can merely ignore all the AI helpers(/underlings).
But now I can also see an argument for why SHF couldn’t do ALD, if it doesn’t have arbitrarily long to deliberate: there would need to be some parallelism/decomposition in SHF, and that might not work well/perfectly for all problems.
OK, so it sounds like your argument why SHF can’t do ALD is (a specific, technical version of) the same argument that I mentioned in my last response. Can you confirm?
Aha, OK. So I either misunderstand or disagree with that.
I think SHF (at least most examples) have the human as “CEO” with AIs as “advisers”, and thus the human can chose to ignore all of the advice and make the decision unaided.
I actually don’t understand why you say they can’t be fully disentangled.
IIRC, it seemed to me during the discussion that your main objection was around whether (e.g.) “arbitrarily long deliberation (ALD)” was (or could be) fully specified in a way that accounts properly for things like deception, manipulation, etc. More concretely, I think you mentioned the possibility of an AI affecting the deliberation process in an undesirable way.
But I think it’s reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like “manipulation”. So do you disagree? Or is your objection something else entirely?
Hey, David here!
Just writing to give some context… The point of this session was to discuss an issue I see with “super-human feedback (SHF)” schemes (e.g. debate, amplification, recursive reward modelling) that use helper AIs to inform human judgments. I guess there was more of an inferential gap going into the session than I expected, so for background: let’s consider the complexity theory viewpoint in feedback (as discussed in section 2.2 of “AI safety via debate”). This implicitly assumes that we have access to a trusted (e.g. human) decision making process (TDMP), sweeping the issues that Stuart mentions under the rug.
Under this view, the goal of SHF is to efficiently emulate the TDMP, accelerating the decision-making. For example, we’d like an agent trained with SHF to be able to quickly (e.g. in a matter of seconds) make decisions that would take the TDMP billions of years to decide. But we don’t aim to change the decisions.
Now, the issue I mentioned is: there doesn’t seem to be any way to evaluate whether the SHF-trained agent is faithfully emulating the TDMP’s decisions on such problems. It seems like, naively, the best we can do is train on problems where the TDMP can make decisions quickly, so that we can use its decisions as ground truth; then we just hope that it generalizes appropriately to the decisions that take TDMP billions of years. And the point of the session was to see if people have ideas for how to do less naive experiments that would allow us to increase our confidence that a SHF-scheme would yield safe generalization to these more difficult decisions.
Imagine there are 2 copies of me, A and B. A makes a decision with some helper AIs, and independently, B makes a decision without their help. A and B make different decisions. Who do we trust? I’m more ready to trust B, since I’m worried about the helper AIs having an undesirable influence on A’s decision-making.
--------------------------------------------------------------------
...So questions of how to define human preferences or values seem mostly orthogonal to this question, which is why I want to assume them away. However, our discussion did make me consider more that I was making an implicit assumption (and this seems hard to avoid), that there was some idealized decision-making process that is assumed to be “what we want”. I’m relatively comfortable with trusting idealized versions of “behavioral cloning/imitation/supervised learning” (P) or “(myopic) reinforcement learning/preference learning” (NP), compared with the SHF-schemes (PSPACE).
One insight I gleaned from our discussion is the usefulness of disentangling:
an idealized process for *defining* “what we want” (HCH was mentioned as potentially a better model of this than “a single human given as long as they want to think about the decision” (which was what I proposed using, for the purposes of the discussion)).
a means of *approximating* that definition.
From this perspective, the discussion topic was: how can we gain empirical evidence for/against this question: “Assuming that the output of a human’s indefinite deliberation is a good definition of ‘what they want’, do SHF-schemes do a good/safe job of approximating that?”
Did anyone have ideas for this? My thinking is that you have to understand or make some assumptions about the nature of TDMP in order to have confidence about safe generalization, because if you just treat it as a black box, then it might be that for some class of queries it will do something that can’t be approximated by SHF-schemes. No matter how you test, you can only conclude that if such queries exist they are not in the test sets you used.
Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?
Regarding the question of how to do empirical work on this topic: I remember there being one thing which seemed potentially interesting, but I couldn’t find it in my notes (yet).
RE the rest of your comment: I guess you are taking issue with the complexity theory analogy; is that correct? An example hypothetical TDMP I used is “arbitrarily long deliberation” (ALD), i.e. a single human is allowed as long as they want to make the decision (I don’t think that’s a perfect “target” for alignment, but it seems like a reasonable starting point). I don’t see why ALD would (even potentially) “do something that can’t be approximated by SHF-schemes”, since those schemes still have the human making a decision.
“Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?” <-- yes, IIUC.
Suppose there’s a cryptographic hash function H inside a human brain whose algorithm is not introspectively accessible, and some secret state S which is also not introspectively accessible. The human can choose to, in each period, run S|Output := H(S|Input) and observe/report Output, so we can ask ALD, what’s Output if you iterate H n times with X as the initial Input and update S each time. (I can try to clarify if it’s not clear what I mean.) I think this can’t be approximated by SHF-schemes, because there’s no way to train ML to approximate H to serve as the baseline agent.
So what is this an analogy for? I think H could stand for human philosophical deliberation, and S for any introspectively inaccessible information in our brain that might go into and be changed by such deliberation.
Yes, please try to clarify. In particular, I don’t understand your “|” notation (as in “S|Output”).
I realized that I was a bit confused in what I said earlier. I think it’s clear that (proposed) SHF schemes should be able to do at least as well as a human, given the same amount of time, because they have human “on top” (as “CEO”) who can merely ignore all the AI helpers(/underlings).
But now I can also see an argument for why SHF couldn’t do ALD, if it doesn’t have arbitrarily long to deliberate: there would need to be some parallelism/decomposition in SHF, and that might not work well/perfectly for all problems.
“|” meant concatenation, so “S|Output := H(S|Input)” means you set S to the first half of H(S|Input), and Output to the second half of H(S|Input).
OK, so it sounds like your argument why SHF can’t do ALD is (a specific, technical version of) the same argument that I mentioned in my last response. Can you confirm?
I’m not sure. It seems like my argument applies even if SHF did have arbitrarily long to deliberate?
Aha, OK. So I either misunderstand or disagree with that.
I think SHF (at least most examples) have the human as “CEO” with AIs as “advisers”, and thus the human can chose to ignore all of the advice and make the decision unaided.
Agree that it’s useful to disentangle them, but it’s also useful to realise that they can’t be fully disentangled… yet.
I actually don’t understand why you say they can’t be fully disentangled.
IIRC, it seemed to me during the discussion that your main objection was around whether (e.g.) “arbitrarily long deliberation (ALD)” was (or could be) fully specified in a way that accounts properly for things like deception, manipulation, etc. More concretely, I think you mentioned the possibility of an AI affecting the deliberation process in an undesirable way.
But I think it’s reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like “manipulation”. So do you disagree? Or is your objection something else entirely?
Hey there!
Given a longer answer here: https://www.lesswrong.com/posts/Q7WiHdSSShkNsgDpa/how-much-can-value-learning-be-disentangled