In the context of Deceptive Alignment, would the ultimate goal of an AI system appear random and uncorrelated with the training distribution’s objectives from a human perspective? Or would it be understandable to humans that the goal is somewhat correlated with the objectives of the training distribution?
In the context of Deceptive Alignment, would the ultimate goal of an AI system appear random and uncorrelated with the training distribution’s objectives from a human perspective? Or would it be understandable to humans that the goal is somewhat correlated with the objectives of the training distribution?