Yeah, I agree debate seems less obvious. I guess I’m more interested in the iterated amplification claim since it seems like you do see iterated amplification as opposed to “avoiding manipulation” or “making a clean distinction between good and bad reasoning”, and that feels confusing to me. (Whereas with debate I can see the arguments for debate incentivizing manipulation, and I don’t think they’re obviously wrong, or obviously correct.)
I still feel like there is some weird counterfactual incentive to manipulate the process
Yeah, this argument makes sense to me, though I question how much such incentives matter in practice. If we include incentives like this, then I’m saying “I think the incentives a) arise for any situation and b) don’t matter in practice, since they never get invoked during training”. (Not just for the automated decomposition example; I think similar arguments apply less strongly to situations involving actual humans.)
there isn’t even incentive to predict humans in strong generality, much less manipulate them, but that is because the examples are simple and not trying to have common information with how humans work.
Agreed.
1) being convinced that there is not an incentive to predict humans in generality (predicting humans only when they are very strictly following a non-humanlike algorithm doesn’t count as predicting humans in generality), or 2) being convinced that this incentive to predict the humans is sufficiently far from incentive to manipulate.
I’m not claiming (1) in full generality. I’m claiming that there’s a spectrum of how much incentive there is to predict humans in generality. On one end we have the automated examples I mentioned above, and on the other end we have sales and marketing. It seems like where we are on this spectrum is primarily dependent on the task and the way you structure your reasoning. If you’re just training your AI system on making better transistors, then it seems like even if there’s a human in the loop your AI system is primarily going to be learning about transistors (or possibly about how to think about transistors in the way that humans think about transistors). Fwiw, I think you can make a similar claim about debate.
If we use iterated amplification to aim for corrigibility, that will probably require the system to learn about agency, though I don’t think it obviously has to learn about humans.
I might also be claiming (2), except I don’t know what you mean by “sufficiently far”. I can understand how prediction behavior is “close” to manipulation behavior (in that many of the skills needed for the first are relevant to the second and vice versa); if that’s what you mean then I’m not claiming (2).
If humans have some innate ability to imitate some non-human process, and use that ability to answer the questions, and thinking about humans does not aid in thinking about that non-human process, I agree that you are not providing any incentive to think about the humans. However, it feels like a lot has to go right for that to work.
I’m definitely not claiming that we can do this. But I don’t think any approach could possibly meet the standard of “thinking about humans does not aid in the goal”; at the very least there is probably some useful information to be gained in updating on “humans decided to build me”, which requires thinking about humans. Which is part of why I prefer thinking about the spectrum.
Yeah, I agree debate seems less obvious. I guess I’m more interested in the iterated amplification claim since it seems like you do see iterated amplification as opposed to “avoiding manipulation” or “making a clean distinction between good and bad reasoning”, and that feels confusing to me. (Whereas with debate I can see the arguments for debate incentivizing manipulation, and I don’t think they’re obviously wrong, or obviously correct.)
Yeah, this argument makes sense to me, though I question how much such incentives matter in practice. If we include incentives like this, then I’m saying “I think the incentives a) arise for any situation and b) don’t matter in practice, since they never get invoked during training”. (Not just for the automated decomposition example; I think similar arguments apply less strongly to situations involving actual humans.)
Agreed.
I’m not claiming (1) in full generality. I’m claiming that there’s a spectrum of how much incentive there is to predict humans in generality. On one end we have the automated examples I mentioned above, and on the other end we have sales and marketing. It seems like where we are on this spectrum is primarily dependent on the task and the way you structure your reasoning. If you’re just training your AI system on making better transistors, then it seems like even if there’s a human in the loop your AI system is primarily going to be learning about transistors (or possibly about how to think about transistors in the way that humans think about transistors). Fwiw, I think you can make a similar claim about debate.
If we use iterated amplification to aim for corrigibility, that will probably require the system to learn about agency, though I don’t think it obviously has to learn about humans.
I might also be claiming (2), except I don’t know what you mean by “sufficiently far”. I can understand how prediction behavior is “close” to manipulation behavior (in that many of the skills needed for the first are relevant to the second and vice versa); if that’s what you mean then I’m not claiming (2).
I’m definitely not claiming that we can do this. But I don’t think any approach could possibly meet the standard of “thinking about humans does not aid in the goal”; at the very least there is probably some useful information to be gained in updating on “humans decided to build me”, which requires thinking about humans. Which is part of why I prefer thinking about the spectrum.