Ryan Kidd comments on Introduction to inaccessible information

Ryan Kidd 13 Dec 2021 11:04 UTC
3 points
0
I think I agree. To the extent that a ‘world model’ is an appropriate abstraction, I think the levers to pull for resolving world model mismatches seem to be:
- Post-facto: train an already capable (prosaic?) AI to explain itself in a way that accounts for world model mismatches via a clever training mechanism and hope that only accessible consequences matter for preserving human option value; or
- Ex-ante: build AI systems in an architecturally transparent manner such that properties of their world model can be inspected and tuned, and hope that the training process makes these AI systems competitive.
I think you are advocating for the latter, or have I misrepresented the levers?
- Charlie Steiner 13 Dec 2021 13:16 UTC
  2 points
  0
  Parent
  Maybe I don’t see a bright line between these things. Adding an “explaining module” to an existing AI and then doing more training is not so different from designing an AI that has an “explaining module” from the start. And training an AI with an “explaining module” isn’t so different from training an AI with a “making sure internal states are somewhat interpretable” module.
  I’m probably advocating something close to “Ex-ante,” but with lots of learning, including learning that informs the AI what features of the world we want it to make interpretable to us.