niplav comments on shortplav

niplav 26 Mar 2022 20:14 UTC
1 point
If this (or a variant) works out it would be pretty cool: One could look at an inconsistent agent and make the preferences consistent in the minimally modifying way, and then know what the agent would ideally want.

One further idea that could incorporate this is the following: We try to learn human preferences, but might worry that we’re not learning them at the right level of abstraction. But if we have some known preferential inconsistencies for humans (e.g. the Allais paradox), we can look at a proposed learned preference and search for it, if it’s not present, we reject it. If it is present, we can then just apply the make_consistent algorithm to the learned preference.