This post implies that we may be able to extrapolate the log-prob likelihood of deception, situational awareness, or power-seeking behavior in future models.
Do you expect that models’ stated desires for these behaviors are the metrics that we should measure? Or that when models become more powerful, the metric to use will be obvious? I generally agree with the observation that capabilities tend to improve more smoothly than early alignment work predicted, but I am skeptical that we will know what is relevant to measure.
This post implies that we may be able to extrapolate the log-prob likelihood of deception, situational awareness, or power-seeking behavior in future models.
Do you expect that models’ stated desires for these behaviors are the metrics that we should measure? Or that when models become more powerful, the metric to use will be obvious? I generally agree with the observation that capabilities tend to improve more smoothly than early alignment work predicted, but I am skeptical that we will know what is relevant to measure.