The way the authors phrase the Superficial Alignment Hypothesis is a bit vague, but they do offer a more concrete corollary:
If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples.
Regardless of what exactly the authors mean by the Hypothesis, it would be falsified if the Corollary was false. And I’m arguing that the Corollary is false.
(1-6) The LIMA results are evidence against the Corollary, because the LIMA results (post-filtering) are so unusually bare (e.g. no benchmark tests), and the evidence that they have released is not promising.
(7*) Here’s a theoretical argument against the Corollary:
Base models are harmful/dishonest/unhelpful because the model assigns significant “likelihood” to harmful/dishonest/unhelpful actors (because such actors contributed to the internet corpus).
Finetuning won’t help because the small set of examples will be consistent with harmful/dishonest/unhelpful actors who are deceptive up until some trigger.
This argument doesn’t generalise to RLHF and ConstitutionalAI, because these break the predictorness of the model.
Concessions:
The authors don’t clarify what “sufficiently” means in the Corollary, so perhaps they have much lower standards, e.g. it’s sufficient if the model responds safely 80% of the time.
Clarifications:
The way the authors phrase the Superficial Alignment Hypothesis is a bit vague, but they do offer a more concrete corollary:
Regardless of what exactly the authors mean by the Hypothesis, it would be falsified if the Corollary was false. And I’m arguing that the Corollary is false.
(1-6) The LIMA results are evidence against the Corollary, because the LIMA results (post-filtering) are so unusually bare (e.g. no benchmark tests), and the evidence that they have released is not promising.
(7*) Here’s a theoretical argument against the Corollary:
Base models are harmful/dishonest/unhelpful because the model assigns significant “likelihood” to harmful/dishonest/unhelpful actors (because such actors contributed to the internet corpus).
Finetuning won’t help because the small set of examples will be consistent with harmful/dishonest/unhelpful actors who are deceptive up until some trigger.
This argument doesn’t generalise to RLHF and ConstitutionalAI, because these break the predictorness of the model.
Concessions:
The authors don’t clarify what “sufficiently” means in the Corollary, so perhaps they have much lower standards, e.g. it’s sufficient if the model responds safely 80% of the time.