Unsupervised learning systems will likely learn many “natural abstractions” of concepts like “trees” or “human values”. Maybe they will even end up being simply a “feature direction”.
One reason to expect this is that to make good predictions, you only need to conserve information that’s useful at a distance. And this information can be imagined being a “natural abstraction”.
If you then have an RL system or supervised learner who can use the unsupervised activations to solve a problem, then it can directly behave in such a way as to “satisfy the natural abstraction” of, e.g., human values.
This would be a quick way for the model to behave well on such a task. Later on, the model might find the unnatural proxy goal and maximize that, but that wouldn’t be the first thing found.
Thus, we may up with alignment by default. Such an aligned AGI could then be used to align successor AIs, and it might be better at that than humans are since it’s smarter.
Note that if we have “alignment by default”, then the alignment of a successor system might work better than with competitive HCH or IRL. Reason: humans+bureaucracies may not be aligned, and IRL finds a utility function, which is of the wrong type signature. This may lead to a compounding of alignment errors when building successor AIs.
Variations of “alignment by default” would, e.g., find the “human value abstraction” in one AI and then “plant it” into the search operation of an AGI. I.e., alignment would not be solved by default for all AIs, but it’s defaulty-enough that we still win.
My Opinion:
Human values are less “one thing” than trees or humans are. “Human values are active” is not a sensible piece of information since there are so many different types of human values. Admittedly, this also applies to trees, but it does feel like it’s pointing to a difficulty.
Overall, this makes me think that, possibly, the proxy goal is often a more natural abstraction than human values. My hope mainly comes from the thought that the proxy goals are hopefully specific enough to the RL/supervised task that they didn’t appear in the unsupervised training phase.
But there are reasons against that: humans will put lots of information about ML training processes into the training data of any unsupervised system, meaning that to make good predictions, the systems should probably represent these proxy goals quite well. Only if they do would they make accurate predictions about, e.g., contemporary alignment errors.
Summary
This article claims that:
Unsupervised learning systems will likely learn many “natural abstractions” of concepts like “trees” or “human values”. Maybe they will even end up being simply a “feature direction”.
One reason to expect this is that to make good predictions, you only need to conserve information that’s useful at a distance. And this information can be imagined being a “natural abstraction”.
If you then have an RL system or supervised learner who can use the unsupervised activations to solve a problem, then it can directly behave in such a way as to “satisfy the natural abstraction” of, e.g., human values.
This would be a quick way for the model to behave well on such a task. Later on, the model might find the unnatural proxy goal and maximize that, but that wouldn’t be the first thing found.
Thus, we may up with alignment by default. Such an aligned AGI could then be used to align successor AIs, and it might be better at that than humans are since it’s smarter.
Note that if we have “alignment by default”, then the alignment of a successor system might work better than with competitive HCH or IRL. Reason: humans+bureaucracies may not be aligned, and IRL finds a utility function, which is of the wrong type signature. This may lead to a compounding of alignment errors when building successor AIs.
Variations of “alignment by default” would, e.g., find the “human value abstraction” in one AI and then “plant it” into the search operation of an AGI. I.e., alignment would not be solved by default for all AIs, but it’s defaulty-enough that we still win.
My Opinion:
Human values are less “one thing” than trees or humans are. “Human values are active” is not a sensible piece of information since there are so many different types of human values. Admittedly, this also applies to trees, but it does feel like it’s pointing to a difficulty.
Overall, this makes me think that, possibly, the proxy goal is often a more natural abstraction than human values. My hope mainly comes from the thought that the proxy goals are hopefully specific enough to the RL/supervised task that they didn’t appear in the unsupervised training phase.
But there are reasons against that: humans will put lots of information about ML training processes into the training data of any unsupervised system, meaning that to make good predictions, the systems should probably represent these proxy goals quite well. Only if they do would they make accurate predictions about, e.g., contemporary alignment errors.