Wentworth’s scheme of alignment by default is not the only scheme to it
We might get partial alignment by default and strengthen it
There are two approaches to solving alignment:
Targeting AI systems at values we’d be “happy” (where we fully informed) for powerful systems to optimise for
[AKA intent alignment]
[RHLF, IRL, value learning more generally, etc.]
Safeguarding systems that are not necessarily robustly intent aligned
[Corrigibility, impact regularisation, boxing, myopia, non agentic systems, mild optimisation, etc.]
We might solve alignment by applying the techniques of 2, to a system that is somewhat aligned. Such an approach becomes more likely if we get partial alignment by default.
More concretely, I currently actually believe not just pretending to believe that:
Self supervised learning on human generated/curated data will get to AGI first
Systems trained in such a way may be very powerful while still being reasonably safe from misalignment risks(enhanced with safeguarding techniques) without us mastering intent alignment/being able to target arbitrary AI systems at arbitrary goals
I really do not think this is some edge case, but a way the world can be with significant probability mass.
Fair point, I should’ve mentioned alignment by default. That said, even the original post introducing it considers it ~10% likely at best.
I think Wentworth is too pessimistic:
Wentworth’s scheme of alignment by default is not the only scheme to it
We might get partial alignment by default and strengthen it
There are two approaches to solving alignment:
Targeting AI systems at values we’d be “happy” (where we fully informed) for powerful systems to optimise for [AKA intent alignment] [RHLF, IRL, value learning more generally, etc.]
Safeguarding systems that are not necessarily robustly intent aligned [Corrigibility, impact regularisation, boxing, myopia, non agentic systems, mild optimisation, etc.]
We might solve alignment by applying the techniques of 2, to a system that is somewhat aligned. Such an approach becomes more likely if we get partial alignment by default.
More concretely, I currently actually believe not just pretending to believe that:
Self supervised learning on human generated/curated data will get to AGI first
Systems trained in such a way may be very powerful while still being reasonably safe from misalignment risks(enhanced with safeguarding techniques) without us mastering intent alignment/being able to target arbitrary AI systems at arbitrary goals
I really do not think this is some edge case, but a way the world can be with significant probability mass.