The most obvious weakness is that such an algorithm could easily detect optimization processes that are acting on us (or, if you believe such things exist, you should believe this algorithm might locate them mistakenly), rather than us ourselves.
I’ve been thinking about this, and I haven’t found any immediately useful way of using your idea, but I’ll keep it in the back of my mind… We haven’t found a good way of identifying agency in the abstract sense (“was cosmic phenonmena X caused by an agent, and if so, which one?” kind of stuff), so this might be a useful simpler problem...
Upon further research, it turns out that preference learning is a field within machine learning, so we can actually try to address this at a much more formal level. That would also get us another benefit: supervised learning algorithms don’t wirehead.
Notably, this fits with our intuition that morality must be “taught” (ie: via labelled data) to actual human children, lest they simply decide that the Good and the Right consists of eating a whole lot of marshmallows.
And if we put that together with a conservation heuristic for acting under moral uncertainty (say: optimize for expectedly moral expected utility, thus requiring higher moral certainty for less-extreme moral decisions), we might just start to make some headway on managing to construct utility functions that would mathematically reflect what their operators actually intend for them to do.
I also have an idea written down in my notebook, which I’ve been refining, that sort of extends from what Luke had written down here. Would it be worth a post?
The most obvious weakness is that such an algorithm could easily detect optimization processes that are acting on us (or, if you believe such things exist, you should believe this algorithm might locate them mistakenly), rather than us ourselves.
I’ve been thinking about this, and I haven’t found any immediately useful way of using your idea, but I’ll keep it in the back of my mind… We haven’t found a good way of identifying agency in the abstract sense (“was cosmic phenonmena X caused by an agent, and if so, which one?” kind of stuff), so this might be a useful simpler problem...
Upon further research, it turns out that preference learning is a field within machine learning, so we can actually try to address this at a much more formal level. That would also get us another benefit: supervised learning algorithms don’t wirehead.
Notably, this fits with our intuition that morality must be “taught” (ie: via labelled data) to actual human children, lest they simply decide that the Good and the Right consists of eating a whole lot of marshmallows.
And if we put that together with a conservation heuristic for acting under moral uncertainty (say: optimize for expectedly moral expected utility, thus requiring higher moral certainty for less-extreme moral decisions), we might just start to make some headway on managing to construct utility functions that would mathematically reflect what their operators actually intend for them to do.
I also have an idea written down in my notebook, which I’ve been refining, that sort of extends from what Luke had written down here. Would it be worth a post?