Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren’t predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don’t really care about engaging with).
In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended models might come to bite you.
Let’s use your example of a model that reports “no one entered the data center”. I might think the right answer is that “no one entered the data center” when I in fact know that physically someone was in the datacenter but they were an authorized person. If I’m reporting this in the context of asking about a security breach, saying “no one entered the data center” when I more precisely mean “no unauthorized person entered the data center” might be totally reasonable.
In this case there’s some ambiguity about what reasonably counts as “no one”. This is perhaps somewhat contrived, but category ambiguity is a cornerstone of linguistic confusion and where I see the division between instrumental and intended models breaking down. I think there are probably some chunk of things we could screen off by making this distinction that are obviously wrong (e.g. the model that tries to tell me “no one entered the data center” when in fact, even given my context of a security breach, some unauthorized person did entered the data center), and that seems useful, so I’m mainly pushing on the idea here that your approach here seems insufficient for addressing alignment concerns on its own.
Not that you necessarily thought it was, but this seems like the relevant kind of issue to want to consider here.
Reading this thread, I wonder if the apparent disagreement doesn’t come from the use of the world “honestly”. The way I understand Paul’s statement of the problem is that “answer questions honestly” could be replaced by “answer questions appropriately to the best of your knowledge”. And his point is that “answer what a human would have answered” is not a good proxy for that (yet still an incentivized one due to how we train neural nets)
From my reading of it, this post’s proposal does provide some plausible ways to incentivize the model to actual search for appropriate answers instead of the ones human would have given, and I don’t think it assumes the existence of true categories and/or essences.
In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended models might come to bite you.
Let’s use your example of a model that reports “no one entered the data center”. I might think the right answer is that “no one entered the data center” when I in fact know that physically someone was in the datacenter but they were an authorized person. If I’m reporting this in the context of asking about a security breach, saying “no one entered the data center” when I more precisely mean “no unauthorized person entered the data center” might be totally reasonable.
In this case there’s some ambiguity about what reasonably counts as “no one”. This is perhaps somewhat contrived, but category ambiguity is a cornerstone of linguistic confusion and where I see the division between instrumental and intended models breaking down. I think there are probably some chunk of things we could screen off by making this distinction that are obviously wrong (e.g. the model that tries to tell me “no one entered the data center” when in fact, even given my context of a security breach, some unauthorized person did entered the data center), and that seems useful, so I’m mainly pushing on the idea here that your approach here seems insufficient for addressing alignment concerns on its own.
Not that you necessarily thought it was, but this seems like the relevant kind of issue to want to consider here.
Reading this thread, I wonder if the apparent disagreement doesn’t come from the use of the world “honestly”. The way I understand Paul’s statement of the problem is that “answer questions honestly” could be replaced by “answer questions appropriately to the best of your knowledge”. And his point is that “answer what a human would have answered” is not a good proxy for that (yet still an incentivized one due to how we train neural nets)
From my reading of it, this post’s proposal does provide some plausible ways to incentivize the model to actual search for appropriate answers instead of the ones human would have given, and I don’t think it assumes the existence of true categories and/or essences.