From the perspective of ontology and memetic engineering, the whole “ontology” or classification of alignments that you give, “fragile, friendly, …” is bad because it’s not based on some theory but rather on the cacophony of commonsensical ideas. These “alignments” don’t even belong to the same type: “Fragile” is an engineering approach (but there are also many other engineering approaches which you haven’t mentioned!), 2-3 and 5-6 are black-box descriptions of some alignment characteristics (at least these seem to belong to the same type), and “Strict” looks like a description of a mathematical mechanism. Also, the names are bad.
Fragile alignment
The name is really bad you use a property of this engineering approach (its fragility) as its name. But there are infinitely many other engineering approaches that are very fragile. For example, consider post-filtering LLM’s outputs for any occurrences of the words like “bomb”, “kill”, “poison”, n-word, etc., and not passing these rollouts through, but passing all the rest. Is this an alignment technique? Yes, if we consider the filter as part of the cognitive architecture, as we should. Is it fragile? Yes.
But when multiple such approaches are combined, this may actually lead to a cognitive architecture that is robustly aligned. This is the promise of LMCAs and the natural language alignment.
It is not clear fragile alignment is even meaningfully helpful – that it does much, survives for long, or causes actions compatible with our survival, once the AGI is smarter than we are, even if we get its details mostly right and faced relatively good conditions. There are overlapping and overdetermined reasons (the strength and validity of which are of course disputed) to expect any good properties to break down exactly when it is important they not break down.
Laissez-faire (any prompt goes), “bare”, non-post-filtered LLM rollouts as a cognitive architecture is obviously doomed. OpenAI has stopped deploying and giving access to base models (GPT-4 is available only in SFT’ed and RLHF’ed form), and I expect that in the next iteration (GPT-5), they will (my wishful thinking) stop even that—they will give access only to an LMCA from the beginning. Even “rollout + post-filtering” is already a primitive form of such LMCA.
In turn, LMCAs in general shouldn’t necessarily inherit the cardinal sin of LLMs (that are “exponentially diverging diffusion process”, in the words of LeCun). And SFT/RLHF (things that you have called “fragile alignment”) could be a part of a robust architecture, as I already noted above.
From the perspective of ontology and memetic engineering, the whole “ontology” or classification of alignments that you give, “fragile, friendly, …” is bad because it’s not based on some theory but rather on the cacophony of commonsensical ideas. These “alignments” don’t even belong to the same type: “Fragile” is an engineering approach (but there are also many other engineering approaches which you haven’t mentioned!), 2-3 and 5-6 are black-box descriptions of some alignment characteristics (at least these seem to belong to the same type), and “Strict” looks like a description of a mathematical mechanism. Also, the names are bad.
The name is really bad you use a property of this engineering approach (its fragility) as its name. But there are infinitely many other engineering approaches that are very fragile. For example, consider post-filtering LLM’s outputs for any occurrences of the words like “bomb”, “kill”, “poison”, n-word, etc., and not passing these rollouts through, but passing all the rest. Is this an alignment technique? Yes, if we consider the filter as part of the cognitive architecture, as we should. Is it fragile? Yes.
But when multiple such approaches are combined, this may actually lead to a cognitive architecture that is robustly aligned. This is the promise of LMCAs and the natural language alignment.
Laissez-faire (any prompt goes), “bare”, non-post-filtered LLM rollouts as a cognitive architecture is obviously doomed. OpenAI has stopped deploying and giving access to base models (GPT-4 is available only in SFT’ed and RLHF’ed form), and I expect that in the next iteration (GPT-5), they will (my wishful thinking) stop even that—they will give access only to an LMCA from the beginning. Even “rollout + post-filtering” is already a primitive form of such LMCA.
In turn, LMCAs in general shouldn’t necessarily inherit the cardinal sin of LLMs (that are “exponentially diverging diffusion process”, in the words of LeCun). And SFT/RLHF (things that you have called “fragile alignment”) could be a part of a robust architecture, as I already noted above.