There are two basic approaches to AI existential safety:
Targeting AI systems at values we’d be “happy” (where we fully informed) for them to optimise for
[RLHF, IRL, value learning more generally, etc.]
Safeguarding systems that may not necessarily be ideally targeted
[Quantilisation, impact regularisation, boxing, myopia, designing non agentic systems, corrigibility, etc.]
“Hodge-podge alignment” provides a highly scalable and viable “defense in depth” research agenda for #2.
Your objection seems to assume that #1 is the only viable approach to AI existential safety, but that seems very much not obvious. Corrigibility is itself an approach for #2 and if I’m not mistaken, Yudkowsky considers it one way of tackling alignment.
To: Cleo Nardo, you should pick a better name for this strategy. “Hodge-podge alignment” is such terrible marketing.
I disagree with the core of this objection.
There are two basic approaches to AI existential safety:
Targeting AI systems at values we’d be “happy” (where we fully informed) for them to optimise for [RLHF, IRL, value learning more generally, etc.]
Safeguarding systems that may not necessarily be ideally targeted [Quantilisation, impact regularisation, boxing, myopia, designing non agentic systems, corrigibility, etc.]
“Hodge-podge alignment” provides a highly scalable and viable “defense in depth” research agenda for #2.
Your objection seems to assume that #1 is the only viable approach to AI existential safety, but that seems very much not obvious. Corrigibility is itself an approach for #2 and if I’m not mistaken, Yudkowsky considers it one way of tackling alignment.
To: Cleo Nardo, you should pick a better name for this strategy. “Hodge-podge alignment” is such terrible marketing.