wassname comments on World-Model Interpretability Is All We Need

wassname 5 Nov 2023 23:56 UTC
1 point
0
Ah, now it makes sense. I was wondering how world model interpretability leads to alignment rather than control. After all, I don’t think you will get far controlling something smarter than you against its will. But alignment of value could scale with large gaps in intelligence.

When that 2nd phase, there are a few things you can do. E.g the 2nd phase reward function could include world model concepts like “virtue”, or you could modify the world model before training.