#5 is appears to be evidence for the hypothesis that, because pretrained foundation models understand human values before they become goal-directed, they’re more likely to optimize for human values and less likely to be deceptively aligned.
Conceptual argument for the hypothesis here: https://forum.effectivealtruism.org/posts/4MTwLjzPeaNyXomnx/deceptive-alignment-is-less-than-1-likely-by-default
#5 is appears to be evidence for the hypothesis that, because pretrained foundation models understand human values before they become goal-directed, they’re more likely to optimize for human values and less likely to be deceptively aligned.
Conceptual argument for the hypothesis here: https://forum.effectivealtruism.org/posts/4MTwLjzPeaNyXomnx/deceptive-alignment-is-less-than-1-likely-by-default