I am having a hard time generating any ontology that says:
I don’t see [let’s try to avoid giving models strong incentives to learn how to manipulate humans] as particularly opposed to methods like iterated amplification or debate.
Here are some guesses:
You are distinguishing between an incentive to manipulate real life humans and an incentive to manipulate human models?
You are claiming that the point of e.g. debate is that when you do it right there is no incentive to manipulate?
You are focusing on the task/output of the system, and internal incentives to learn how to manipulate don’t count?
You are focusing on the task/output of the system, and internal incentives to learn how to manipulate don’t count?
This seems closest, though I’m not saying that internal incentives don’t count—I don’t see what these incentives even are (or, I maybe see them in the superintelligent utility maximizer model, but not in other models).
If yes, then if we now change to a human giving literally identical feedback, do you agree that then nothing would change (i.e. the resulting agent would not have an incentive to manipulate the human)?
If yes, then what’s the difference between that scenario and one where there are internal incentives to manipulate the human?
Possibly you say no to the first question because of wireheading-style concerns; if so my followup question would probably be something like “why doesn’t this apply to any system trained from a feedback signal, whether human-generated or automated?” (Though that’s from a curious-about-your-beliefs perspective. On my beliefs I mostly reject wireheading as a specific thing to be worried about, and think of it as a non-special instance of a broader class of failures.)
Before trying to answer the question, I’m just gonna say a bunch of things that might not make sense (either because I am being unclear or being stupid).
So, I think the debate example is much more *about* manipulation, than the iterated amplification example, so I was largely replying to the class that includes IA and debate, I can imagine saying that Iterated amplification done right does not provide an incentive to manipulate the human.
I think that a process that was optimizing directly for finding a fixed point of X=AmplifyH(X) does have an incentive to manipulate the human, however this is not exactly what IA is doing, because it is only having the gradients pass through the first X in the fixed point equation, and I can imagine arguing that the incentive to manipulate comes from having the gradient pass through the second X. If you iterate enough times, I think you might effectively have some optimization juice passing through modifying the second X, but it might be much less. I am confused about how to think about optimization towards a moving target being different from optimization towards finding a fixed point.
I think that even if you only look at the effect of following the gradients coming from the effect of changing the first X, you are at least providing an incentive to predict the human on a wide range of inputs. In some cases, your range of inputs might be such there isn’t actually information about the human in the answers, which I think is where you are trying to get with the automated decomposition strategies. If humans have some innate ability to imitate some non-human process, and use that ability to answer the questions, and thinking about humans does not aid in thinking about that non-human process, I agree that you are not providing any incentive to think about the humans. However, it feels like a lot has to go right for that to work.
On the other hand, maybe we just think it is okay to predict, but not manipulate, the humans, while they are answering questions with a lot of common information about humans’ work, which is what I think IA is supposed to be doing. In this case, even if I were to say that there is no incentive to “manipulate the human”, I still argue that there is “incentive to learn how to manipulate the human,” because predicting the human (on a wide range of inputs) is a very similar task to manipulating the human.
Okay, now I’ll try to answer the question. I don’t understand the question. I assume you are talking about incentive to manipulate in the simple examples with permutations etc in the experiments. I think there is no ability to manipulate those processes, and thus no gradient signal towards manipulation of the automated process. I still feel like there is some weird counterfactual incentive to manipulate the process, but I don’t know how to say what that means, and I agree that it does not affect what actually happens in the system.
I agree that changing to a human will not change anything (except via also adding the change where the system is told (or can deduce) that it is interacting with the human, and thus ignores the gradient signal, to do some treacherous turn). Anyway, in these worlds, we likely already lost, and I am not focusing on them. I think the short answer to your question is in practice no, there is no difference, and there isn’t even incentive to predict humans in strong generality, much less manipulate them, but that is because the examples are simple and not trying to have common information with how humans work.
I think that there are two paths to go down of crux opportunities for me here, and I’m sure we could find more: 1) being convinced that there is not an incentive to predict humans in generality (predicting humans only when they are very strictly following a non-humanlike algorithm doesn’t count as predicting humans in generality), or 2) being convinced that this incentive to predict the humans is sufficiently far from incentive to manipulate.
Yeah, I agree debate seems less obvious. I guess I’m more interested in the iterated amplification claim since it seems like you do see iterated amplification as opposed to “avoiding manipulation” or “making a clean distinction between good and bad reasoning”, and that feels confusing to me. (Whereas with debate I can see the arguments for debate incentivizing manipulation, and I don’t think they’re obviously wrong, or obviously correct.)
I still feel like there is some weird counterfactual incentive to manipulate the process
Yeah, this argument makes sense to me, though I question how much such incentives matter in practice. If we include incentives like this, then I’m saying “I think the incentives a) arise for any situation and b) don’t matter in practice, since they never get invoked during training”. (Not just for the automated decomposition example; I think similar arguments apply less strongly to situations involving actual humans.)
there isn’t even incentive to predict humans in strong generality, much less manipulate them, but that is because the examples are simple and not trying to have common information with how humans work.
Agreed.
1) being convinced that there is not an incentive to predict humans in generality (predicting humans only when they are very strictly following a non-humanlike algorithm doesn’t count as predicting humans in generality), or 2) being convinced that this incentive to predict the humans is sufficiently far from incentive to manipulate.
I’m not claiming (1) in full generality. I’m claiming that there’s a spectrum of how much incentive there is to predict humans in generality. On one end we have the automated examples I mentioned above, and on the other end we have sales and marketing. It seems like where we are on this spectrum is primarily dependent on the task and the way you structure your reasoning. If you’re just training your AI system on making better transistors, then it seems like even if there’s a human in the loop your AI system is primarily going to be learning about transistors (or possibly about how to think about transistors in the way that humans think about transistors). Fwiw, I think you can make a similar claim about debate.
If we use iterated amplification to aim for corrigibility, that will probably require the system to learn about agency, though I don’t think it obviously has to learn about humans.
I might also be claiming (2), except I don’t know what you mean by “sufficiently far”. I can understand how prediction behavior is “close” to manipulation behavior (in that many of the skills needed for the first are relevant to the second and vice versa); if that’s what you mean then I’m not claiming (2).
If humans have some innate ability to imitate some non-human process, and use that ability to answer the questions, and thinking about humans does not aid in thinking about that non-human process, I agree that you are not providing any incentive to think about the humans. However, it feels like a lot has to go right for that to work.
I’m definitely not claiming that we can do this. But I don’t think any approach could possibly meet the standard of “thinking about humans does not aid in the goal”; at the very least there is probably some useful information to be gained in updating on “humans decided to build me”, which requires thinking about humans. Which is part of why I prefer thinking about the spectrum.
I am having a hard time generating any ontology that says:
Here are some guesses:
You are distinguishing between an incentive to manipulate real life humans and an incentive to manipulate human models?
You are claiming that the point of e.g. debate is that when you do it right there is no incentive to manipulate?
You are focusing on the task/output of the system, and internal incentives to learn how to manipulate don’t count?
These are just guesses.
This seems closest, though I’m not saying that internal incentives don’t count—I don’t see what these incentives even are (or, I maybe see them in the superintelligent utility maximizer model, but not in other models).
Do you agree that the agents in Supervising strong learners by amplifying weak experts don’t have an incentive to manipulate the automated decomposition strategies?
If yes, then if we now change to a human giving literally identical feedback, do you agree that then nothing would change (i.e. the resulting agent would not have an incentive to manipulate the human)?
If yes, then what’s the difference between that scenario and one where there are internal incentives to manipulate the human?
Possibly you say no to the first question because of wireheading-style concerns; if so my followup question would probably be something like “why doesn’t this apply to any system trained from a feedback signal, whether human-generated or automated?” (Though that’s from a curious-about-your-beliefs perspective. On my beliefs I mostly reject wireheading as a specific thing to be worried about, and think of it as a non-special instance of a broader class of failures.)
Unedited stream of thought:
Before trying to answer the question, I’m just gonna say a bunch of things that might not make sense (either because I am being unclear or being stupid).
So, I think the debate example is much more *about* manipulation, than the iterated amplification example, so I was largely replying to the class that includes IA and debate, I can imagine saying that Iterated amplification done right does not provide an incentive to manipulate the human.
I think that a process that was optimizing directly for finding a fixed point of X=AmplifyH(X) does have an incentive to manipulate the human, however this is not exactly what IA is doing, because it is only having the gradients pass through the first X in the fixed point equation, and I can imagine arguing that the incentive to manipulate comes from having the gradient pass through the second X. If you iterate enough times, I think you might effectively have some optimization juice passing through modifying the second X, but it might be much less. I am confused about how to think about optimization towards a moving target being different from optimization towards finding a fixed point.
I think that even if you only look at the effect of following the gradients coming from the effect of changing the first X, you are at least providing an incentive to predict the human on a wide range of inputs. In some cases, your range of inputs might be such there isn’t actually information about the human in the answers, which I think is where you are trying to get with the automated decomposition strategies. If humans have some innate ability to imitate some non-human process, and use that ability to answer the questions, and thinking about humans does not aid in thinking about that non-human process, I agree that you are not providing any incentive to think about the humans. However, it feels like a lot has to go right for that to work.
On the other hand, maybe we just think it is okay to predict, but not manipulate, the humans, while they are answering questions with a lot of common information about humans’ work, which is what I think IA is supposed to be doing. In this case, even if I were to say that there is no incentive to “manipulate the human”, I still argue that there is “incentive to learn how to manipulate the human,” because predicting the human (on a wide range of inputs) is a very similar task to manipulating the human.
Okay, now I’ll try to answer the question. I don’t understand the question. I assume you are talking about incentive to manipulate in the simple examples with permutations etc in the experiments. I think there is no ability to manipulate those processes, and thus no gradient signal towards manipulation of the automated process. I still feel like there is some weird counterfactual incentive to manipulate the process, but I don’t know how to say what that means, and I agree that it does not affect what actually happens in the system.
I agree that changing to a human will not change anything (except via also adding the change where the system is told (or can deduce) that it is interacting with the human, and thus ignores the gradient signal, to do some treacherous turn). Anyway, in these worlds, we likely already lost, and I am not focusing on them. I think the short answer to your question is in practice no, there is no difference, and there isn’t even incentive to predict humans in strong generality, much less manipulate them, but that is because the examples are simple and not trying to have common information with how humans work.
I think that there are two paths to go down of crux opportunities for me here, and I’m sure we could find more: 1) being convinced that there is not an incentive to predict humans in generality (predicting humans only when they are very strictly following a non-humanlike algorithm doesn’t count as predicting humans in generality), or 2) being convinced that this incentive to predict the humans is sufficiently far from incentive to manipulate.
Yeah, I agree debate seems less obvious. I guess I’m more interested in the iterated amplification claim since it seems like you do see iterated amplification as opposed to “avoiding manipulation” or “making a clean distinction between good and bad reasoning”, and that feels confusing to me. (Whereas with debate I can see the arguments for debate incentivizing manipulation, and I don’t think they’re obviously wrong, or obviously correct.)
Yeah, this argument makes sense to me, though I question how much such incentives matter in practice. If we include incentives like this, then I’m saying “I think the incentives a) arise for any situation and b) don’t matter in practice, since they never get invoked during training”. (Not just for the automated decomposition example; I think similar arguments apply less strongly to situations involving actual humans.)
Agreed.
I’m not claiming (1) in full generality. I’m claiming that there’s a spectrum of how much incentive there is to predict humans in generality. On one end we have the automated examples I mentioned above, and on the other end we have sales and marketing. It seems like where we are on this spectrum is primarily dependent on the task and the way you structure your reasoning. If you’re just training your AI system on making better transistors, then it seems like even if there’s a human in the loop your AI system is primarily going to be learning about transistors (or possibly about how to think about transistors in the way that humans think about transistors). Fwiw, I think you can make a similar claim about debate.
If we use iterated amplification to aim for corrigibility, that will probably require the system to learn about agency, though I don’t think it obviously has to learn about humans.
I might also be claiming (2), except I don’t know what you mean by “sufficiently far”. I can understand how prediction behavior is “close” to manipulation behavior (in that many of the skills needed for the first are relevant to the second and vice versa); if that’s what you mean then I’m not claiming (2).
I’m definitely not claiming that we can do this. But I don’t think any approach could possibly meet the standard of “thinking about humans does not aid in the goal”; at the very least there is probably some useful information to be gained in updating on “humans decided to build me”, which requires thinking about humans. Which is part of why I prefer thinking about the spectrum.