For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn’t require a very good model of the world.
I suspect this is impossible in principle, because human values are dependent on our models of the world.
The key is to develop methods that scale; where values become aligned as the world model approaches human level of capability.
But then there is a scope, apparently unexplored so far, for finding morally relevant subsets of value. You don’t have to see everything’s though the lens of utilitarianism.
Good summary. But concerning your final point :
I suspect this is impossible in principle, because human values are dependent on our models of the world.
The key is to develop methods that scale; where values become aligned as the world model approaches human level of capability.
But then there is a scope, apparently unexplored so far, for finding morally relevant subsets of value. You don’t have to see everything’s though the lens of utilitarianism.