In this argument, you’ve implicitly assumed that there is only one function/structure which suffices for being getting high enough training performance to be selected while also not being a long term objective (aka a deceptive objective).
I could imagine this being basically right, but it certainly seems non-obvious to me.
E.g., there might be many things which are extremely highly correlated with reward that are represented in the world model. Or more generally, there are in principle many objective computations that result in trying as hard to get reward as the deceptive model would try.
(The potential for “multiple” objectives only makes a constant factor difference, but this is exactly the same as the case for deceptive objectives.)
The fact that these objectives generalize differently maybe implies they aren’t “aligned”, but in that case there is another key category of objectives: non-exactly-aligned and non-deceptive objectives. And obviously our AI isn’t going to be literally exactly aligned.
Note that non-exactly-aligned and non-deceptive objectives could suffice for safety in practice even if not perfectly aligned (e.g. due to myopia).
Yep, that’s exactly right. As always, once you start making more complex assumptions, things get more and more complicated, and it starts to get harder to model things in nice concrete mathematical terms. I would defend the value of having actual concrete mathematical models here—I think it’s super easy to confuse yourself in this domain if you aren’t doing that (e.g. as I think the confused reasoning about counting arguments in this post demonstrates). So I like having really concrete models, but only in the “all models are wrong, but some are useful” sense, as I talk about in “In defense of probably wrong mechanistic models.”
Also, the main point I was trying to make is that the counting argument is both sound and consistent with known generalization properties of machine learning (and in fact predicts them), and for that purpose I went with the simplest possible formalization of the counting argument.
In this argument, you’ve implicitly assumed that there is only one function/structure which suffices for being getting high enough training performance to be selected while also not being a long term objective (aka a deceptive objective).
I could imagine this being basically right, but it certainly seems non-obvious to me.
E.g., there might be many things which are extremely highly correlated with reward that are represented in the world model. Or more generally, there are in principle many objective computations that result in trying as hard to get reward as the deceptive model would try.
(The potential for “multiple” objectives only makes a constant factor difference, but this is exactly the same as the case for deceptive objectives.)
The fact that these objectives generalize differently maybe implies they aren’t “aligned”, but in that case there is another key category of objectives: non-exactly-aligned and non-deceptive objectives. And obviously our AI isn’t going to be literally exactly aligned.
Note that non-exactly-aligned and non-deceptive objectives could suffice for safety in practice even if not perfectly aligned (e.g. due to myopia).
Yep, that’s exactly right. As always, once you start making more complex assumptions, things get more and more complicated, and it starts to get harder to model things in nice concrete mathematical terms. I would defend the value of having actual concrete mathematical models here—I think it’s super easy to confuse yourself in this domain if you aren’t doing that (e.g. as I think the confused reasoning about counting arguments in this post demonstrates). So I like having really concrete models, but only in the “all models are wrong, but some are useful” sense, as I talk about in “In defense of probably wrong mechanistic models.”
Also, the main point I was trying to make is that the counting argument is both sound and consistent with known generalization properties of machine learning (and in fact predicts them), and for that purpose I went with the simplest possible formalization of the counting argument.