The good news I’ll share is that some of the most important insights about the safety/alignment work done on LLMs do transfer over pretty well to a lot of plausible AGI architectures, so while there’s a little safety loss each time you go from 1 to 4, a lot of the theoretical ways to achieve alignment of these new systems remain intact, though the danger here is that the implementation difficulty pushes the safety tax too high, which is a pretty real concern.
Specifically, the insights I’m talking about are the controllability of AI with data, combined with their feedback on RL being way denser than human RL from evolution, meaning that instrumental convergence is affected significantly.
I think that IF we could solve alignment for level 2 systems, many of the lessons (maybe most? Maybe if we are lucky all?) would transfer to level 4 systems. However, time is running out to do that before the industry moves on beyond level 2.
Let me explain.
Faithful CoT is not a way to make your AI agent share your values, or truly strive towards the goals and principles you dictate in the Spec. Instead, it’s a way to read your AI’s mind, so that you can monitor its thoughts and notice when your alignment techniques failed and its values/goals/principles/etc. ended up different from what you hoped.
So faithful CoT is not by itself a solution, but it’s a property which allows us to iterate towards a solution.
Which is great! But we better start iterating fast because time is running out. When the corporations move on to e.g. level 3 and 4 systems, they will be loathe to spend much compute tinkering with legacy level 2 systems, much less train new advanced level 2 systems.
I mostly just agree with your comment here, and IMO the things I don’t exactly get aren’t worth disagreeing too much about, since it’s more in the domain of a reframing than an actual disagreement.
The good news I’ll share is that some of the most important insights about the safety/alignment work done on LLMs do transfer over pretty well to a lot of plausible AGI architectures, so while there’s a little safety loss each time you go from 1 to 4, a lot of the theoretical ways to achieve alignment of these new systems remain intact, though the danger here is that the implementation difficulty pushes the safety tax too high, which is a pretty real concern.
Specifically, the insights I’m talking about are the controllability of AI with data, combined with their feedback on RL being way denser than human RL from evolution, meaning that instrumental convergence is affected significantly.
I think that IF we could solve alignment for level 2 systems, many of the lessons (maybe most? Maybe if we are lucky all?) would transfer to level 4 systems. However, time is running out to do that before the industry moves on beyond level 2.
Let me explain.
Faithful CoT is not a way to make your AI agent share your values, or truly strive towards the goals and principles you dictate in the Spec. Instead, it’s a way to read your AI’s mind, so that you can monitor its thoughts and notice when your alignment techniques failed and its values/goals/principles/etc. ended up different from what you hoped.
So faithful CoT is not by itself a solution, but it’s a property which allows us to iterate towards a solution.
Which is great! But we better start iterating fast because time is running out. When the corporations move on to e.g. level 3 and 4 systems, they will be loathe to spend much compute tinkering with legacy level 2 systems, much less train new advanced level 2 systems.
I mostly just agree with your comment here, and IMO the things I don’t exactly get aren’t worth disagreeing too much about, since it’s more in the domain of a reframing than an actual disagreement.