jessicata comments on A first look at the hard problem of corrigibility

jessicata 14 Nov 2015 15:27 UTC
0 points
AF
Yes, I think that learning the user’s instrumental preferences is a good way to get corrigible behavior. I’m hoping to explore the idea of learning an ontology in which instrumental preferences can be represented. There seems to be a spectrum between learning a user’s terminal preferences and learning their actions, with learning instrumental preferences falling in between these.

I’m planning on writing up some posts about models for goal-directed value learning. I like your suggestion of presenting the problem so it’s understandable to mainstream researchers; I’ll think about what to do about this after writing up the posts.