Gordon Seidoh Worley comments on World-Model Interpretability Is All We Need

Gordon Seidoh Worley 17 Jan 2023 3:36 UTC
LW: 2 AF: 1
0
AF
Isn’t a special case of aiming at any target we want the goals we would want it to have? And whatever goals we’d want it to have would be informed by our ontology? So what I’m saying is I think there’s a case where the generality of your claim breaks down.
- Thane Ruthenis 17 Jan 2023 6:02 UTC
  LW: 8 AF: 4
  2
  AF Parent
  Goals are functions over the concepts in one’s internal ontology, yes. But having a concept for something doesn’t mean caring about it — your knowing what a “paperclip” is doesn’t make you a paperclip-maximizer.
  The idea here isn’t to train an AI with the goals we want from scratch, it’s to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.