Let us suppose that we’ve solved the technical problem of AI Alignment — i. e., the problem of AI control. We have some method of reliably pointing our AGIs towards the tasks or goals we want, such as the universal flourishing of all sapient life. As per the Orthogonality Thesis, no such method would allow us to only point it at universal flourishing — any such method would allow us to point the AGI at anything whatsoever.
Or any similar scheme where we solve the alignment problem via training the AI (via e.g. self supervised learning) on human generated/curated data (perhaps with additional safety features).
[Janus’ Simulators cover important differences of self supervised models in more detail.]
It may be the case that such models are the only generally intelligent systems, but systems trained in such a way do not exhibit strong orthogonality.
And it does not follow that we can in full generality, point such systems at arbitrary other targets.
It may be the case that such models are the only generally intelligent systems, but systems trained in such a way do not exhibit strong orthogonality.
And it does not follow that we can in full generality, point such systems at arbitrary other targets.
I disagree with the first claim, primarily due to the only part. I do believe that language modeling might be enough, though I also think certain other paths are enough, like RL. It’s just that SSL took off first here.
Wentworth’s scheme of alignment by default is not the only scheme to it
We might get partial alignment by default and strengthen it
There are two approaches to solving alignment:
Targeting AI systems at values we’d be “happy” (where we fully informed) for powerful systems to optimise for
[AKA intent alignment]
[RHLF, IRL, value learning more generally, etc.]
Safeguarding systems that are not necessarily robustly intent aligned
[Corrigibility, impact regularisation, boxing, myopia, non agentic systems, mild optimisation, etc.]
We might solve alignment by applying the techniques of 2, to a system that is somewhat aligned. Such an approach becomes more likely if we get partial alignment by default.
More concretely, I currently actually believe not just pretending to believe that:
Self supervised learning on human generated/curated data will get to AGI first
Systems trained in such a way may be very powerful while still being reasonably safe from misalignment risks(enhanced with safeguarding techniques) without us mastering intent alignment/being able to target arbitrary AI systems at arbitrary goals
I really do not think this is some edge case, but a way the world can be with significant probability mass.
This does not hold if we get alignment by default.
Or any similar scheme where we solve the alignment problem via training the AI (via e.g. self supervised learning) on human generated/curated data (perhaps with additional safety features).
[Janus’ Simulators cover important differences of self supervised models in more detail.]
It may be the case that such models are the only generally intelligent systems, but systems trained in such a way do not exhibit strong orthogonality.
And it does not follow that we can in full generality, point such systems at arbitrary other targets.
I disagree with the first claim, primarily due to the only part. I do believe that language modeling might be enough, though I also think certain other paths are enough, like RL. It’s just that SSL took off first here.
Well, the argument holds if there’s a meaningful time period in which all general AI systems were trained via self supervised learning.
Fair point, I should’ve mentioned alignment by default. That said, even the original post introducing it considers it ~10% likely at best.
I think Wentworth is too pessimistic:
Wentworth’s scheme of alignment by default is not the only scheme to it
We might get partial alignment by default and strengthen it
There are two approaches to solving alignment:
Targeting AI systems at values we’d be “happy” (where we fully informed) for powerful systems to optimise for [AKA intent alignment] [RHLF, IRL, value learning more generally, etc.]
Safeguarding systems that are not necessarily robustly intent aligned [Corrigibility, impact regularisation, boxing, myopia, non agentic systems, mild optimisation, etc.]
We might solve alignment by applying the techniques of 2, to a system that is somewhat aligned. Such an approach becomes more likely if we get partial alignment by default.
More concretely, I currently actually believe not just pretending to believe that:
Self supervised learning on human generated/curated data will get to AGI first
Systems trained in such a way may be very powerful while still being reasonably safe from misalignment risks(enhanced with safeguarding techniques) without us mastering intent alignment/being able to target arbitrary AI systems at arbitrary goals
I really do not think this is some edge case, but a way the world can be with significant probability mass.