Yes, I think that learning the user’s instrumental preferences is a good way to get corrigible behavior. I’m hoping to explore the idea of learning an ontology in which instrumental preferences can be represented. There seems to be a spectrum between learning a user’s terminal preferences and learning their actions, with learning instrumental preferences falling in between these.
I’m planning on writing up some posts about models for goal-directed value learning. I like your suggestion of presenting the problem so it’s understandable to mainstream researchers; I’ll think about what to do about this after writing up the posts.
Yes, I think that learning the user’s instrumental preferences is a good way to get corrigible behavior. I’m hoping to explore the idea of learning an ontology in which instrumental preferences can be represented. There seems to be a spectrum between learning a user’s terminal preferences and learning their actions, with learning instrumental preferences falling in between these.
I’m planning on writing up some posts about models for goal-directed value learning. I like your suggestion of presenting the problem so it’s understandable to mainstream researchers; I’ll think about what to do about this after writing up the posts.