“Total expected trust” is supposed to mean the sum of total trust over time (the area below the curve in fig. 1). This area increases with time and can’t be increased beyond the point where everyone is dead (assuming that a useful definition of “trust” excludes dead people), so the AGI would be incentivized to keep humanity alive and even maximize the number of humans over time. By discounting future trust, short-term trust would gain a higher weight. So the question whether deception is optimal depends on this discounting factor, among other things.
...which has other problems. Notably, an ideal agent inherently pushes up into the high extreme of the reward distribution, and the tails come apart. For any metric (imperfectly) correlated with what you’re actually trying to reward, there comes a point where the metric no longer well-describes the thing you’re actually trying to reward.
My concern is mostly from the perspective of an (initially at least) non-ideal agent getting attracted to a local optimum.
Do you agree at least that my concern is indeed likely a local optimum in behavior?
Yes, it is absolutely possible that the trust maximizer as described here would end up in a local optimum. This is certainly tricky to avoid. This post is far from a feasible solution to the alignment problem. We’re just trying to point out some interesting features of trust as a goal, which might be helpful in combination with other measures/ideas.
I don’t think that’s very likely. It is in the power of the trust-maximiser to influence the shape of the “trust curve”, both in the honest and dishonest versions. So in principle, it should be able to increase trust over time, or at least prevent a significant decrease (if it plays honest). Even if trust decreases over time, total expected trust would still be increasing as long as at least a small fraction of people still trusts in the machine. So the problem here is not so much that the AI would have an incentive to kill all humans but that it may have an incentive to switch to deception, if this becomes the more effective strategy at some point.
“Total expected trust” is supposed to mean the sum of total trust over time (the area below the curve in fig. 1). This area increases with time and can’t be increased beyond the point where everyone is dead (assuming that a useful definition of “trust” excludes dead people), so the AGI would be incentivized to keep humanity alive and even maximize the number of humans over time. By discounting future trust, short-term trust would gain a higher weight. So the question whether deception is optimal depends on this discounting factor, among other things.
As an aside: you appear to be looking at this from the perspective of an ideal agent[1].
My concern is mostly from the perspective of an (initially at least) non-ideal agent getting attracted to a local optimum.
Do you agree at least that my concern is indeed likely a local optimum in behavior?
...which has other problems. Notably, an ideal agent inherently pushes up into the high extreme of the reward distribution, and the tails come apart. For any metric (imperfectly) correlated with what you’re actually trying to reward, there comes a point where the metric no longer well-describes the thing you’re actually trying to reward.
Yes, it is absolutely possible that the trust maximizer as described here would end up in a local optimum. This is certainly tricky to avoid. This post is far from a feasible solution to the alignment problem. We’re just trying to point out some interesting features of trust as a goal, which might be helpful in combination with other measures/ideas.
Consider, for instance, if the AGI believes that the long-term average of change in trust over time is inherently negative.
I don’t think that’s very likely. It is in the power of the trust-maximiser to influence the shape of the “trust curve”, both in the honest and dishonest versions. So in principle, it should be able to increase trust over time, or at least prevent a significant decrease (if it plays honest). Even if trust decreases over time, total expected trust would still be increasing as long as at least a small fraction of people still trusts in the machine. So the problem here is not so much that the AI would have an incentive to kill all humans but that it may have an incentive to switch to deception, if this becomes the more effective strategy at some point.