Re Project 4, you might find my semi-abandoned (mostly because I wasn’t and still am not in a position to make further progress on it) research agenda for deconfusing human values useful.
This work by Michael Aird and Justin Shovelain might also be relevant: “Using vector fields to visualise preferences and make them consistent”
And I have a post where I demonstrate that reward modeling can extract utility functions from non-transitive preference orderings: “Inferring utility functions from locally non-transitive preferences”
(Extremely cool project ideas btw)
Re Project 4, you might find my semi-abandoned (mostly because I wasn’t and still am not in a position to make further progress on it) research agenda for deconfusing human values useful.
This work by Michael Aird and Justin Shovelain might also be relevant: “Using vector fields to visualise preferences and make them consistent”
And I have a post where I demonstrate that reward modeling can extract utility functions from non-transitive preference orderings: “Inferring utility functions from locally non-transitive preferences”
(Extremely cool project ideas btw)