Your steps (2)-(4) seem to rely fairly heavily on the naturality of the class described in (1), e.g. because (2) has to recognize (1)s which requires that we can point to (1)s. If by “with the [[sole?]] goal of imitating Evan” you mean that
A. the model is actually really *only* trying to imitate Evan,
B. the model is competent to not accidentally also try to do something else (e.g. because the ways it pursues its goal are themselves malign under distributional shift), and
C. the training process you use will not tip the internal dynamics of the model over into a strategically malign state (there was never any incentive to prevent that from happening any more robustly than just barely enough to get good answers on the training set, and I think we agree that there’s a whole pile of [ability to understand and pursue far-reaching consequences] sitting in the model, making strategically malign states pretty close in model-space for natural metrics),
then yes this would plausibly not be deceptive, but it seems like a very unnatural class. I tried to argue that it’s unnatural in the long paragraph with the different kinds of myopia, where “by (strong) default” = “it would be unnatural to be otherwise”.
Note that (A) and (B) are not actually that hard—e.g. LCDT solves both problems.
Your (C), in my opinion, is where all the action is, and is in fact the hardest part of this whole story—which is what I was trying to say in the original post when I said that (2) was the hard part.
Okay, I think I’m getting a little more where you’re coming from? Not sure. Maybe I’ll read the LCDT thing soon (though I’m pretty skeptical of those claims).
(Not sure if it’s useful to say this, but as a meta note, from my perspective the words in the post aren’t pinned down enough to make it at all clear that the hard part is (2) rather than (1); you say “natural” in (1), and I don’t know what you mean by that such that (1) isn’t hard.)
Maybe I’m not emphasizing how unnatural I think (A) is. Like, it’s barely even logically consistent. I know that (A) is logically consistent, for some funny construal of “only trying”, because Evan is a perfect imitation of Evan; and more generally a good WBE could maybe be appropriately construed as not trying to do anything other than imitate Evan; and ideally an FAI could be given an instruction so that it doesn’t, say, have any appreciable impacts other than the impacts of an Evan-imitation. For anything that’s remotely natural and not “shaped” like Evan is “shaped”, I’m not sure it even makes sense to be only trying to imitate Evan; to imitate Evan you have to do a whole lot of stuff, including strategically arranging cognition, reason about far-reaching consequences in general, etc., which already constitutes trying to do something other than imitating Evan. When you’re doing consequentialist reasoning, that already puts you very close in algorithm-space to malign strategic thinking, so “consequentialist but not deceptive (hence not malignly consequentialist)” is very unnatural; IMO like half of the whole the alignment problem is “get consequentialist reasoning that isn’t consequentalisting towards some random thing”.
Your steps (2)-(4) seem to rely fairly heavily on the naturality of the class described in (1), e.g. because (2) has to recognize (1)s which requires that we can point to (1)s. If by “with the [[sole?]] goal of imitating Evan” you mean that
A. the model is actually really *only* trying to imitate Evan,
B. the model is competent to not accidentally also try to do something else (e.g. because the ways it pursues its goal are themselves malign under distributional shift), and
C. the training process you use will not tip the internal dynamics of the model over into a strategically malign state (there was never any incentive to prevent that from happening any more robustly than just barely enough to get good answers on the training set, and I think we agree that there’s a whole pile of [ability to understand and pursue far-reaching consequences] sitting in the model, making strategically malign states pretty close in model-space for natural metrics),
then yes this would plausibly not be deceptive, but it seems like a very unnatural class. I tried to argue that it’s unnatural in the long paragraph with the different kinds of myopia, where “by (strong) default” = “it would be unnatural to be otherwise”.
Note that (A) and (B) are not actually that hard—e.g. LCDT solves both problems.
Your (C), in my opinion, is where all the action is, and is in fact the hardest part of this whole story—which is what I was trying to say in the original post when I said that (2) was the hard part.
Okay, I think I’m getting a little more where you’re coming from? Not sure. Maybe I’ll read the LCDT thing soon (though I’m pretty skeptical of those claims).
(Not sure if it’s useful to say this, but as a meta note, from my perspective the words in the post aren’t pinned down enough to make it at all clear that the hard part is (2) rather than (1); you say “natural” in (1), and I don’t know what you mean by that such that (1) isn’t hard.)
Maybe I’m not emphasizing how unnatural I think (A) is. Like, it’s barely even logically consistent. I know that (A) is logically consistent, for some funny construal of “only trying”, because Evan is a perfect imitation of Evan; and more generally a good WBE could maybe be appropriately construed as not trying to do anything other than imitate Evan; and ideally an FAI could be given an instruction so that it doesn’t, say, have any appreciable impacts other than the impacts of an Evan-imitation. For anything that’s remotely natural and not “shaped” like Evan is “shaped”, I’m not sure it even makes sense to be only trying to imitate Evan; to imitate Evan you have to do a whole lot of stuff, including strategically arranging cognition, reason about far-reaching consequences in general, etc., which already constitutes trying to do something other than imitating Evan. When you’re doing consequentialist reasoning, that already puts you very close in algorithm-space to malign strategic thinking, so “consequentialist but not deceptive (hence not malignly consequentialist)” is very unnatural; IMO like half of the whole the alignment problem is “get consequentialist reasoning that isn’t consequentalisting towards some random thing”.