In the context of Deceptive Alignment, would the ultimate goal of an AI system appear random and uncorrelated with the training distribution’s objectives from a human perspective? Or would it be understandable to humans that the goal is somewhat correlated with the objectives of the training distribution?
For instance, in the article below, it is written that “the model just has some random proxies that were picked up early on, and that’s the thing that it cares about.” To what extent does it learn random proxies?
If an AI system pursues ultimate goals such as power or curiosity, there seems to be a pseudocorrelation regardless of what the base objective is.
On the other hand, can it possibly learn to pursue a goal completely unrelated to the context of the training distribution, such as mass-producing objects of a peculiar shape?
In the context of Deceptive Alignment, would the ultimate goal of an AI system appear random and uncorrelated with the training distribution’s objectives from a human perspective? Or would it be understandable to humans that the goal is somewhat correlated with the objectives of the training distribution?
For instance, in the article below, it is written that “the model just has some random proxies that were picked up early on, and that’s the thing that it cares about.” To what extent does it learn random proxies?
https://www.lesswrong.com/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment
If an AI system pursues ultimate goals such as power or curiosity, there seems to be a pseudocorrelation regardless of what the base objective is.
On the other hand, can it possibly learn to pursue a goal completely unrelated to the context of the training distribution, such as mass-producing objects of a peculiar shape?