I want to consider models that learn to predict both “how a human will answer question Q” (the instrumental model) and “the real answer to question Q” (the intended model). These two models share almost all of their computation — which is dedicated to figuring out what actually happens in the world. They differ only when it comes time to actually extract the answer. I’ll describe the resulting model as having a “world model,” an “instrumental head,” and an “intended head.”
This seems massively underspecified in that it’s really unclear to me what’s actually different between the instrumental and intended models.
I say this because you posit the intended model gives “the real answer”, but I don’t see a means offered by which to tell “real” answers from “fake” ones. Further, for somewhat deep philosophical reasons, I also don’t expect there is any such thing as a “real” answer anway, only one that is more or less useful to some purpose, and since ultimately it’s humans setting this all up, any “real” answer is ultimately a human answer.
The only difference I can find seems to be a subtle one about whether or not you’re directly or indirectly imitating human answers, which is probably relevant for dealing with a class of failure modes like overindexing on what humans actually do vs. what we would do if we were smarter, knew more, etc. but also still leaves you human imitation since there’s still imitation of human concerns taking place.
Now, that actually sounds kinda good to me, but it’s not what you seem to be explicitly saying when you talk about the instrumental and intended model.
I don’t think anyone has a precise general definition of “answer questions honestly” (though I often consider simple examples in which the meaning is clear). But we do all understand how “imitate what a human would say” is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards “imitate what a human would say” is clearly a problem to be solved even if other concepts are philosophically ambiguous.
Sometimes a model might say something like “No one entered the datacenter” when what they really mean is “Someone entered the datacenter, got control of the hard drives with surveillance logs, and modified them to show no trace of their presence.” In this case I’d say the answer is “wrong;” when such wrong answers appear as a critical part of a story about catastrophic failure, I’m tempted to look at why they were wrong to try to find a root cause of failure, and to try to look for algorithms that avoid the failure by not being “wrong” in the same intuitive sense. The mechanism in this post is one way that you can get this kind of wrong answer, namely by imitating human answers, and so that’s something we can try to fix.
On my perspective, the only things that are really fundamental are:
Algorithms to train ML systems. These are programs you can run.
Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren’t predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don’t really care about engaging with).
Everything else is just a heuristic to help us understand why an algorithm might work or where we might look for a possible failure story.
I think this is one of the upsides of my research methodology—although it requires people to get on the same page about algorithms and about predictions (of the form “X could happen”), we don’t need to start on the same page about all the other vague concepts. Instead we can develop shared senses of those concepts over time by grounding them out in concrete algorithms and failure stories. I think this is how shared concepts are developed in most functional fields (e.g. in mathematics you start with a shared sense of what constitutes a valid proof, and then build shared mathematical intuitions on top of that by seeing what successfully predicts your ability to write a proof).
Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren’t predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don’t really care about engaging with).
In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended models might come to bite you.
Let’s use your example of a model that reports “no one entered the data center”. I might think the right answer is that “no one entered the data center” when I in fact know that physically someone was in the datacenter but they were an authorized person. If I’m reporting this in the context of asking about a security breach, saying “no one entered the data center” when I more precisely mean “no unauthorized person entered the data center” might be totally reasonable.
In this case there’s some ambiguity about what reasonably counts as “no one”. This is perhaps somewhat contrived, but category ambiguity is a cornerstone of linguistic confusion and where I see the division between instrumental and intended models breaking down. I think there are probably some chunk of things we could screen off by making this distinction that are obviously wrong (e.g. the model that tries to tell me “no one entered the data center” when in fact, even given my context of a security breach, some unauthorized person did entered the data center), and that seems useful, so I’m mainly pushing on the idea here that your approach here seems insufficient for addressing alignment concerns on its own.
Not that you necessarily thought it was, but this seems like the relevant kind of issue to want to consider here.
Reading this thread, I wonder if the apparent disagreement doesn’t come from the use of the world “honestly”. The way I understand Paul’s statement of the problem is that “answer questions honestly” could be replaced by “answer questions appropriately to the best of your knowledge”. And his point is that “answer what a human would have answered” is not a good proxy for that (yet still an incentivized one due to how we train neural nets)
From my reading of it, this post’s proposal does provide some plausible ways to incentivize the model to actual search for appropriate answers instead of the ones human would have given, and I don’t think it assumes the existence of true categories and/or essences.
This seems massively underspecified in that it’s really unclear to me what’s actually different between the instrumental and intended models.
I say this because you posit the intended model gives “the real answer”, but I don’t see a means offered by which to tell “real” answers from “fake” ones. Further, for somewhat deep philosophical reasons, I also don’t expect there is any such thing as a “real” answer anway, only one that is more or less useful to some purpose, and since ultimately it’s humans setting this all up, any “real” answer is ultimately a human answer.
The only difference I can find seems to be a subtle one about whether or not you’re directly or indirectly imitating human answers, which is probably relevant for dealing with a class of failure modes like overindexing on what humans actually do vs. what we would do if we were smarter, knew more, etc. but also still leaves you human imitation since there’s still imitation of human concerns taking place.
Now, that actually sounds kinda good to me, but it’s not what you seem to be explicitly saying when you talk about the instrumental and intended model.
I don’t think anyone has a precise general definition of “answer questions honestly” (though I often consider simple examples in which the meaning is clear). But we do all understand how “imitate what a human would say” is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards “imitate what a human would say” is clearly a problem to be solved even if other concepts are philosophically ambiguous.
Sometimes a model might say something like “No one entered the datacenter” when what they really mean is “Someone entered the datacenter, got control of the hard drives with surveillance logs, and modified them to show no trace of their presence.” In this case I’d say the answer is “wrong;” when such wrong answers appear as a critical part of a story about catastrophic failure, I’m tempted to look at why they were wrong to try to find a root cause of failure, and to try to look for algorithms that avoid the failure by not being “wrong” in the same intuitive sense. The mechanism in this post is one way that you can get this kind of wrong answer, namely by imitating human answers, and so that’s something we can try to fix.
On my perspective, the only things that are really fundamental are:
Algorithms to train ML systems. These are programs you can run.
Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren’t predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don’t really care about engaging with).
Everything else is just a heuristic to help us understand why an algorithm might work or where we might look for a possible failure story.
I think this is one of the upsides of my research methodology—although it requires people to get on the same page about algorithms and about predictions (of the form “X could happen”), we don’t need to start on the same page about all the other vague concepts. Instead we can develop shared senses of those concepts over time by grounding them out in concrete algorithms and failure stories. I think this is how shared concepts are developed in most functional fields (e.g. in mathematics you start with a shared sense of what constitutes a valid proof, and then build shared mathematical intuitions on top of that by seeing what successfully predicts your ability to write a proof).
In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended models might come to bite you.
Let’s use your example of a model that reports “no one entered the data center”. I might think the right answer is that “no one entered the data center” when I in fact know that physically someone was in the datacenter but they were an authorized person. If I’m reporting this in the context of asking about a security breach, saying “no one entered the data center” when I more precisely mean “no unauthorized person entered the data center” might be totally reasonable.
In this case there’s some ambiguity about what reasonably counts as “no one”. This is perhaps somewhat contrived, but category ambiguity is a cornerstone of linguistic confusion and where I see the division between instrumental and intended models breaking down. I think there are probably some chunk of things we could screen off by making this distinction that are obviously wrong (e.g. the model that tries to tell me “no one entered the data center” when in fact, even given my context of a security breach, some unauthorized person did entered the data center), and that seems useful, so I’m mainly pushing on the idea here that your approach here seems insufficient for addressing alignment concerns on its own.
Not that you necessarily thought it was, but this seems like the relevant kind of issue to want to consider here.
Reading this thread, I wonder if the apparent disagreement doesn’t come from the use of the world “honestly”. The way I understand Paul’s statement of the problem is that “answer questions honestly” could be replaced by “answer questions appropriately to the best of your knowledge”. And his point is that “answer what a human would have answered” is not a good proxy for that (yet still an incentivized one due to how we train neural nets)
From my reading of it, this post’s proposal does provide some plausible ways to incentivize the model to actual search for appropriate answers instead of the ones human would have given, and I don’t think it assumes the existence of true categories and/or essences.