DragonGod comments on Towards Hodge-podge Alignment

DragonGod 21 Dec 2022 11:57 UTC
1 point
0
I disagree with the core of this objection.

There are two basic approaches to AI existential safety:
1. Targeting AI systems at values we’d be “happy” (where we fully informed) for them to optimise for [RLHF, IRL, value learning more generally, etc.]
2. Safeguarding systems that may not necessarily be ideally targeted [Quantilisation, impact regularisation, boxing, myopia, designing non agentic systems, corrigibility, etc.]
“Hodge-podge alignment” provides a highly scalable and viable “defense in depth” research agenda for #2.

Your objection seems to assume that #1 is the only viable approach to AI existential safety, but that seems very much not obvious. Corrigibility is itself an approach for #2 and if I’m not mistaken, Yudkowsky considers it one way of tackling alignment.

To: Cleo Nardo, you should pick a better name for this strategy. “Hodge-podge alignment” is such terrible marketing.
What links here?
- DragonGod's comment on Towards Hodge-podge Alignment by Cleo Nardo (21 Dec 2022 12:19 UTC; 1 point)