...which has other problems. Notably, an ideal agent inherently pushes up into the high extreme of the reward distribution, and the tails come apart. For any metric (imperfectly) correlated with what you’re actually trying to reward, there comes a point where the metric no longer well-describes the thing you’re actually trying to reward.
My concern is mostly from the perspective of an (initially at least) non-ideal agent getting attracted to a local optimum.
Do you agree at least that my concern is indeed likely a local optimum in behavior?
Yes, it is absolutely possible that the trust maximizer as described here would end up in a local optimum. This is certainly tricky to avoid. This post is far from a feasible solution to the alignment problem. We’re just trying to point out some interesting features of trust as a goal, which might be helpful in combination with other measures/ideas.
As an aside: you appear to be looking at this from the perspective of an ideal agent[1].
My concern is mostly from the perspective of an (initially at least) non-ideal agent getting attracted to a local optimum.
Do you agree at least that my concern is indeed likely a local optimum in behavior?
...which has other problems. Notably, an ideal agent inherently pushes up into the high extreme of the reward distribution, and the tails come apart. For any metric (imperfectly) correlated with what you’re actually trying to reward, there comes a point where the metric no longer well-describes the thing you’re actually trying to reward.
Yes, it is absolutely possible that the trust maximizer as described here would end up in a local optimum. This is certainly tricky to avoid. This post is far from a feasible solution to the alignment problem. We’re just trying to point out some interesting features of trust as a goal, which might be helpful in combination with other measures/ideas.