Ah, now it makes sense. I was wondering how world model interpretability leads to alignment rather than control. After all, I don’t think you will get far controlling something smarter than you against its will. But alignment of value could scale with large gaps in intelligence.
When that 2nd phase, there are a few things you can do. E.g the 2nd phase reward function could include world model concepts like “virtue”, or you could modify the world model before training.
Ah, now it makes sense. I was wondering how world model interpretability leads to alignment rather than control. After all, I don’t think you will get far controlling something smarter than you against its will. But alignment of value could scale with large gaps in intelligence.
When that 2nd phase, there are a few things you can do. E.g the 2nd phase reward function could include world model concepts like “virtue”, or you could modify the world model before training.