The main problem I see with hodge-podge-style strategies is that most alignment ideas fail in roughly-the-same cases, for roughly-the-same reasons. It’s the same hard cases/hard subproblems which kill most plans. In particular, section B.2 (and to a lesser extent B.1 - B.3) of List of Lethalities covers “core problems” which strategies usually fail to handle.
This may be the case but, well constructed assemblages should have all the safety guarantees of any of their primitives.
Insomuch as assemblages are non viable, then no strategy is viable because if a strategy was viable, it could be included in the assemblage and it would then be viable.
More concretely, this research proposal is a pareto improvement to the research landscape and it seems to me that it might potentially be a significant improvement.
In particular:
I expect that many alignment primitives would be synergistic.
E.g. quantilisation and impact regularisation
Boxing and ~everything
A Swiss cheese style approach to safety seems intuitively sensible from an alignment as security mindset
I.e. Assemblages naturally provide security in depth
It’s a highly scalable agenda in a way most others are not
It can exploit the theory overhang
Easy to onboard new software engineers and have them produce useful alignment work
It’s robust to epistemic uncertainty about what the solution is
Modular alignment primitives are an ideal framework for alignment as a service (AaaS)
They are probably going to be easier to incorporate into extant systems
This will probably boost adoption compared to bespoke solutions or approaches that require the system to be built from scratch for safety
If there’s a convenient alignment as a service offering, then even organisations not willing to pay the time/development cost of alignment may adopt alignment offerings
No one genuinely wants to build misaligned systems, so if we eliminate/minimise the alignment tax, more people will build aligned system
AaaS could solve/considerably mitigate the coordination problems confronting the field
I don’t like the hodge-podge hypothesis and think it’s just a distraction, so maybe I’m coming at it from a different angle. As I see it, the case for this research proposal is:
There are two basic approaches to AI existential safety:
Targeting AI systems at values we’d be “happy” (where we fully informed) for them to optimise for [RLHF, IRL, value learning more generally, etc.]
Safeguarding systems that may not necessarily be ideally targeted [Quantilisation, impact regularisation, boxing, myopia, designing non agentic systems, corrigibility, etc.]
“Hodge-podge alignment” provides a highly scalable and viable “defense in depth” research agenda for #2.
The main problem I see with hodge-podge-style strategies is that most alignment ideas fail in roughly-the-same cases, for roughly-the-same reasons. It’s the same hard cases/hard subproblems which kill most plans. In particular, section B.2 (and to a lesser extent B.1 - B.3) of List of Lethalities covers “core problems” which strategies usually fail to handle.
This may be the case but, well constructed assemblages should have all the safety guarantees of any of their primitives.
Insomuch as assemblages are non viable, then no strategy is viable because if a strategy was viable, it could be included in the assemblage and it would then be viable.
More concretely, this research proposal is a pareto improvement to the research landscape and it seems to me that it might potentially be a significant improvement.
In particular:
I expect that many alignment primitives would be synergistic.
E.g. quantilisation and impact regularisation
Boxing and ~everything
A Swiss cheese style approach to safety seems intuitively sensible from an alignment as security mindset
I.e. Assemblages naturally provide security in depth
It’s a highly scalable agenda in a way most others are not
It can exploit the theory overhang
Easy to onboard new software engineers and have them produce useful alignment work
It’s robust to epistemic uncertainty about what the solution is
Modular alignment primitives are an ideal framework for alignment as a service (AaaS)
They are probably going to be easier to incorporate into extant systems
This will probably boost adoption compared to bespoke solutions or approaches that require the system to be built from scratch for safety
If there’s a convenient alignment as a service offering, then even organisations not willing to pay the time/development cost of alignment may adopt alignment offerings
No one genuinely wants to build misaligned systems, so if we eliminate/minimise the alignment tax, more people will build aligned system
AaaS could solve/considerably mitigate the coordination problems confronting the field
I don’t like the hodge-podge hypothesis and think it’s just a distraction, so maybe I’m coming at it from a different angle. As I see it, the case for this research proposal is: