davidad comments on World-Model Interpretability Is All We Need

davidad 17 Jan 2023 18:16 UTC
LW: 9 AF: 4
0
AF
In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those points didn’t get enough emphasis and people focused more on route 1 there, which seems much more hopeless.
- wassname 5 Nov 2023 23:56 UTC
  1 point
  0
  Parent
  Ah, now it makes sense. I was wondering how world model interpretability leads to alignment rather than control. After all, I don’t think you will get far controlling something smarter than you against its will. But alignment of value could scale with large gaps in intelligence.
  
  When that 2nd phase, there are a few things you can do. E.g the 2nd phase reward function could include world model concepts like “virtue”, or you could modify the world model before training.