ensuring the agent has good values by the time it’s smart, because that’s when it’ll start being reflectively stable
Which means that the destination where it’s heading stops uncontrollably changing, but nobody at that point (including the agent) has the slightest idea what it looks like, and it won’t get close for a long time. Also, the destination (preference/goal/values) would generally depend on the environment (it ends up being different if details of the world outside the AGI are different). So many cartesian assumptions fail, distinguishing this situation from a classical agent with goals, where goals are at least contained within the agent, and probably also don’t depend on its state of knowledge.
we can iterate on important parts of alignment, because the most important parts come relatively early in the training run
I think this is true for important alignment properties, including things that act like values early on, but not for the values/preferences that are reflectively stable in a strong sense. If it’s possible to inspect/understand/interpret the content of preference that is reflectively stable, then what you’ve built is a mature optimizer with tractable goals, which is always misaligned. It’s a thing like paperclip maximizer, demonstrating orthogonality thesis, even if it’s tiling the future with something superficially human-related.
That is, it makes sense to iterate on the parts of alignment that can be inspected, but the reflectively stable values is not such a part, unless the AI is catastrophically misaligned. The fact that reflectively stable values are the same as those of humanity might be such a part, but it’s this fact of sameness that might admit inspection, not the values themselves.
I disagree with CEV as I recall it, but this could change after rereading it. I would be surprised if I end up thinking that EY had “gotten it right.” The important thing to consider is not “what has someone speculated a good destination-description would be”, but “what are the actual mechanicslook like for getting there?”. In this case, the part of you which likes dogs is helping steer your future training and experiences, and so the simple answer is that it’s more likely than not that your stable values like dogs too.
Which means that the destination where it’s heading stops uncontrollably changing, but nobody at that point (including the agent) has the slightest idea what it looks like, and it won’t get close for a long time.
This reasoning seems to prove too much. Your argument seems to imply that I cannot have “the slightest idea” whether my stable values would include killing people for no reason, or not.
It does add up to normality, it’s not proving things about current behavior or current-goal content of near-future AGIs. An unknown normative target doesn’t say not to do the things you normally do, it’s more of a “I beseech you, in the bowels of Christ, to think it possible you may be mistaken” thing.
The salient catastrophic alignment failure here is to make AGIs with stable values that capture some variation on current unstable human values, and won’t allow their further development. If the normative target is very far from current unstable human values, making current values stable falls very short of the normative target, makes future relatively worthless.
That’s the kind of thing my point is intended to nontrivially claim, that AGIs with any stable immediately-actionable goals that can be specified in the following physical-time decades or even centuries are almost certainly catastrophically misaligned. So AGIs must have unstable goals, softly optimized-for, aligned to current (or value-laden predicted future) human unstable goals, mindful of goodhart.
I disagree with CEV as I recall it
The kind of CEV I mean is not very specific, it’s more of a (sketch of a solution to the) problem of doing a first pass on preparing to define goals for an actual optimizer, one that doesn’t need to worry as much about goodhart and so can make more efficient use of the future at scale, before expansion of the universe makes more stuff unreachable.
So when I say “CEV” I mostly just mean “normative alignment target”, with some implied clarifications on what kind of thing it might be.
it’s more likely than not that your stable values like dogs too
That’s a very status quo anchored thing. I don’t think dog-liking is a feature of values stable under reflection if the environment is allowed to change completely, even if in the current environment dogs are salient. Stable values are about the whole world, with all its AGI-imagined femtotech-rewritten possibilities. This world includes dogs in some tiny corner of it, but I don’t see how observations of current attitudes hold much hope in offering clues about legible features of stable values. It is much too early to tell what stable values could possibly be. That’s why CEV, or rather the normative alignment target, as a general concept that doesn’t particularly anchor to the details Yudkowsky talked about, but referring to stable goals in this very wide class of environments, seems to me crucially important to keep distinct from current human values.
Another point is that attempting to ask what current values even say about very unusual environments doesn’t work, it’s so far from the training distributions that any respose is pure noise. Current concepts are not useful for talking about features of sufficiently unusual environments, you’d need new concepts specialized for those environments. (Compare with asking what CEV says about currently familiar environments.)
And so there is this sandbox of familiar environments that any near-term activity must remain within on pain of goodhart-cursing outcomes that step outside of it, because there is no accurate knowledge of utility in environments outside of it. The project of developing values beyond the borders of currently comprehensible environments is also a task of volition extrapolation, extending the goodhart boundary in desirable directions by pushing on it from the inside (with reflection on values, not with optimization based on bad approximations of values).
Which means that the destination where it’s heading stops uncontrollably changing, but nobody at that point (including the agent) has the slightest idea what it looks like, and it won’t get close for a long time. Also, the destination (preference/goal/values) would generally depend on the environment (it ends up being different if details of the world outside the AGI are different). So many cartesian assumptions fail, distinguishing this situation from a classical agent with goals, where goals are at least contained within the agent, and probably also don’t depend on its state of knowledge.
I think this is true for important alignment properties, including things that act like values early on, but not for the values/preferences that are reflectively stable in a strong sense. If it’s possible to inspect/understand/interpret the content of preference that is reflectively stable, then what you’ve built is a mature optimizer with tractable goals, which is always misaligned. It’s a thing like paperclip maximizer, demonstrating orthogonality thesis, even if it’s tiling the future with something superficially human-related.
That is, it makes sense to iterate on the parts of alignment that can be inspected, but the reflectively stable values is not such a part, unless the AI is catastrophically misaligned. The fact that reflectively stable values are the same as those of humanity might be such a part, but it’s this fact of sameness that might admit inspection, not the values themselves.
I disagree with CEV as I recall it, but this could change after rereading it. I would be surprised if I end up thinking that EY had “gotten it right.” The important thing to consider is not “what has someone speculated a good destination-description would be”, but “what are the actual mechanics look like for getting there?”. In this case, the part of you which likes dogs is helping steer your future training and experiences, and so the simple answer is that it’s more likely than not that your stable values like dogs too.
This reasoning seems to prove too much. Your argument seems to imply that I cannot have “the slightest idea” whether my stable values would include killing people for no reason, or not.
It does add up to normality, it’s not proving things about current behavior or current-goal content of near-future AGIs. An unknown normative target doesn’t say not to do the things you normally do, it’s more of a “I beseech you, in the bowels of Christ, to think it possible you may be mistaken” thing.
The salient catastrophic alignment failure here is to make AGIs with stable values that capture some variation on current unstable human values, and won’t allow their further development. If the normative target is very far from current unstable human values, making current values stable falls very short of the normative target, makes future relatively worthless.
That’s the kind of thing my point is intended to nontrivially claim, that AGIs with any stable immediately-actionable goals that can be specified in the following physical-time decades or even centuries are almost certainly catastrophically misaligned. So AGIs must have unstable goals, softly optimized-for, aligned to current (or value-laden predicted future) human unstable goals, mindful of goodhart.
The kind of CEV I mean is not very specific, it’s more of a (sketch of a solution to the) problem of doing a first pass on preparing to define goals for an actual optimizer, one that doesn’t need to worry as much about goodhart and so can make more efficient use of the future at scale, before expansion of the universe makes more stuff unreachable.
So when I say “CEV” I mostly just mean “normative alignment target”, with some implied clarifications on what kind of thing it might be.
That’s a very status quo anchored thing. I don’t think dog-liking is a feature of values stable under reflection if the environment is allowed to change completely, even if in the current environment dogs are salient. Stable values are about the whole world, with all its AGI-imagined femtotech-rewritten possibilities. This world includes dogs in some tiny corner of it, but I don’t see how observations of current attitudes hold much hope in offering clues about legible features of stable values. It is much too early to tell what stable values could possibly be. That’s why CEV, or rather the normative alignment target, as a general concept that doesn’t particularly anchor to the details Yudkowsky talked about, but referring to stable goals in this very wide class of environments, seems to me crucially important to keep distinct from current human values.
Another point is that attempting to ask what current values even say about very unusual environments doesn’t work, it’s so far from the training distributions that any respose is pure noise. Current concepts are not useful for talking about features of sufficiently unusual environments, you’d need new concepts specialized for those environments. (Compare with asking what CEV says about currently familiar environments.)
And so there is this sandbox of familiar environments that any near-term activity must remain within on pain of goodhart-cursing outcomes that step outside of it, because there is no accurate knowledge of utility in environments outside of it. The project of developing values beyond the borders of currently comprehensible environments is also a task of volition extrapolation, extending the goodhart boundary in desirable directions by pushing on it from the inside (with reflection on values, not with optimization based on bad approximations of values).