It seems to me that there are a bunch of standard arguments which you are ignoring because they’re formulated in an old frame that you’re trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you’ve instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.
I agree that we may need to be quite skillful in providing “good”/carefully considered reward signals on the data distribution actually fed to the AI. (I also think it’s possible we have substantial degrees of freedom there.) In this sense, we might need to give “robustly” good feedback.
However, one intuition which I hadn’t properly communicated was: to make OP’s story go well, we don’t need e.g. an outer objective which robustly grades every plan or sequence of events the AI could imagine, such that optimizing that objective globally produces good results. This isn’t just good reward signals on data distribution (e.g. real vs fake diamonds), this is non-upwards-error reward signals in all AI-imaginable situations, which seems thoroughly doomed to me. And this story avoids at least that problem, which I am relieved by. (And my current guess is that this “robust grading” problem doesn’t just reappear elsewhere, although I think there are still a range of other difficult problems remaining. See also my post Alignment allows “nonrobust” decision-influences and doesn’t require robust grading.)
And so I might have been saying “Hey isn’t this cool we can avoid the worst parts of Goodhart by exiting outer/inner as a frame” while thinking of the above intuition (but not communicating it explicitly, because I didn’t have that sufficient clarity as yet). But maybe you reacted ”??? how does this avoid the need to reliably grade on-distribution situations, it’s totally nontrivial to do that and it seems quite probable that we have to.” Both seem true to me!
(I’m not saying this was the whole of our disagreement, but it seems like a relevant guess.)
I agree that we may need to be quite skillful in providing “good”/carefully considered reward signals on the data distribution actually fed to the AI. (I also think it’s possible we have substantial degrees of freedom there.) In this sense, we might need to give “robustly” good feedback.
However, one intuition which I hadn’t properly communicated was: to make OP’s story go well, we don’t need e.g. an outer objective which robustly grades every plan or sequence of events the AI could imagine, such that optimizing that objective globally produces good results. This isn’t just good reward signals on data distribution (e.g. real vs fake diamonds), this is non-upwards-error reward signals in all AI-imaginable situations, which seems thoroughly doomed to me. And this story avoids at least that problem, which I am relieved by. (And my current guess is that this “robust grading” problem doesn’t just reappear elsewhere, although I think there are still a range of other difficult problems remaining. See also my post Alignment allows “nonrobust” decision-influences and doesn’t require robust grading.)
And so I might have been saying “Hey isn’t this cool we can avoid the worst parts of Goodhart by exiting outer/inner as a frame” while thinking of the above intuition (but not communicating it explicitly, because I didn’t have that sufficient clarity as yet). But maybe you reacted ”??? how does this avoid the need to reliably grade on-distribution situations, it’s totally nontrivial to do that and it seems quite probable that we have to.” Both seem true to me!
(I’m not saying this was the whole of our disagreement, but it seems like a relevant guess.)