Your concern is justified if the trust-maximizer only maximizes short-term trust. This depends on the discount of future cumulated trust given in its goal function. In an ideal goal function, there would be a balance between short-term and long-term trust, so that honesty would pay out in the long term, but there wouldn’t be an incentive to postpone all trust into the far future. This is certainly a difficult balance.
Hm. Could you please clarify your ‘trust’ utility function? I don’t understand your distinction between short-term and long-term trust in this context. I understand discounting, but don’t see how it helps in this situation.
My issue occurs even with zero discounting, where it is arguably a local maxima that a non-initially-perfect agent could fall into. Any non-zero amount of discounting, meaning the agent weighs short-term rewards higher than long-term rewards, would increase the likelihood of this happening, not decrease (and very well may make it the optimal solution!)
(My reading of the article was assuming that trust was a ‘bank’ of sorts that could be added or removed from, to be very informal. Something along the lines of e.g. reviews, in which case yes, everyone being simultaneously killed would freeze the current rating indefinitely. Note that this situation I describe has no discounting. )
*****
“To reduce this risk, multiple “trust indicators” could be used as reward signals, including the actual behavior of people (for example, how often they interact with the trust-maximizer and whether they follow its recommendations).”
Post-step-3:
0%[1] of people decline to interact with the trust-maximizer.
0%[1] of people decline to follow the recommendations of the trust-maximizer.
100%[2] of people[3] interact with the trust-maximizer.
100%[2] of people[3] follow the recommendations of the trust-maximizer.
“Total expected trust” is supposed to mean the sum of total trust over time (the area below the curve in fig. 1). This area increases with time and can’t be increased beyond the point where everyone is dead (assuming that a useful definition of “trust” excludes dead people), so the AGI would be incentivized to keep humanity alive and even maximize the number of humans over time. By discounting future trust, short-term trust would gain a higher weight. So the question whether deception is optimal depends on this discounting factor, among other things.
...which has other problems. Notably, an ideal agent inherently pushes up into the high extreme of the reward distribution, and the tails come apart. For any metric (imperfectly) correlated with what you’re actually trying to reward, there comes a point where the metric no longer well-describes the thing you’re actually trying to reward.
My concern is mostly from the perspective of an (initially at least) non-ideal agent getting attracted to a local optimum.
Do you agree at least that my concern is indeed likely a local optimum in behavior?
Yes, it is absolutely possible that the trust maximizer as described here would end up in a local optimum. This is certainly tricky to avoid. This post is far from a feasible solution to the alignment problem. We’re just trying to point out some interesting features of trust as a goal, which might be helpful in combination with other measures/ideas.
I don’t think that’s very likely. It is in the power of the trust-maximiser to influence the shape of the “trust curve”, both in the honest and dishonest versions. So in principle, it should be able to increase trust over time, or at least prevent a significant decrease (if it plays honest). Even if trust decreases over time, total expected trust would still be increasing as long as at least a small fraction of people still trusts in the machine. So the problem here is not so much that the AI would have an incentive to kill all humans but that it may have an incentive to switch to deception, if this becomes the more effective strategy at some point.
Your concern is justified if the trust-maximizer only maximizes short-term trust. This depends on the discount of future cumulated trust given in its goal function. In an ideal goal function, there would be a balance between short-term and long-term trust, so that honesty would pay out in the long term, but there wouldn’t be an incentive to postpone all trust into the far future. This is certainly a difficult balance.
Hm. Could you please clarify your ‘trust’ utility function? I don’t understand your distinction between short-term and long-term trust in this context. I understand discounting, but don’t see how it helps in this situation.
My issue occurs even with zero discounting, where it is arguably a local maxima that a non-initially-perfect agent could fall into. Any non-zero amount of discounting, meaning the agent weighs short-term rewards higher than long-term rewards, would increase the likelihood of this happening, not decrease (and very well may make it the optimal solution!)
(My reading of the article was assuming that trust was a ‘bank’ of sorts that could be added or removed from, to be very informal. Something along the lines of e.g. reviews, in which case yes, everyone being simultaneously killed would freeze the current rating indefinitely. Note that this situation I describe has no discounting. )
*****
“To reduce this risk, multiple “trust indicators” could be used as reward signals, including the actual behavior of people (for example, how often they interact with the trust-maximizer and whether they follow its recommendations).”
Post-step-3:
0%[1] of people decline to interact with the trust-maximizer.
0%[1] of people decline to follow the recommendations of the trust-maximizer.
100%[2] of people[3] interact with the trust-maximizer.
100%[2] of people[3] follow the recommendations of the trust-maximizer.
Ok, so this is strictly speaking 0⁄0. That being said, better hope your programmer chose 0/0=1 in this case...
Ok, so this is strictly speaking 0⁄0. That being said, better hope your programmer chose 0/0=0 in this case...
(Who are alive. That being said, changing this to include dead people has a potential for unintended consequences of its own[4])
E.g. a high birth rate and high death rate being preferable to a medium birth rate and low death rate.
“Total expected trust” is supposed to mean the sum of total trust over time (the area below the curve in fig. 1). This area increases with time and can’t be increased beyond the point where everyone is dead (assuming that a useful definition of “trust” excludes dead people), so the AGI would be incentivized to keep humanity alive and even maximize the number of humans over time. By discounting future trust, short-term trust would gain a higher weight. So the question whether deception is optimal depends on this discounting factor, among other things.
As an aside: you appear to be looking at this from the perspective of an ideal agent[1].
My concern is mostly from the perspective of an (initially at least) non-ideal agent getting attracted to a local optimum.
Do you agree at least that my concern is indeed likely a local optimum in behavior?
...which has other problems. Notably, an ideal agent inherently pushes up into the high extreme of the reward distribution, and the tails come apart. For any metric (imperfectly) correlated with what you’re actually trying to reward, there comes a point where the metric no longer well-describes the thing you’re actually trying to reward.
Yes, it is absolutely possible that the trust maximizer as described here would end up in a local optimum. This is certainly tricky to avoid. This post is far from a feasible solution to the alignment problem. We’re just trying to point out some interesting features of trust as a goal, which might be helpful in combination with other measures/ideas.
Consider, for instance, if the AGI believes that the long-term average of change in trust over time is inherently negative.
I don’t think that’s very likely. It is in the power of the trust-maximiser to influence the shape of the “trust curve”, both in the honest and dishonest versions. So in principle, it should be able to increase trust over time, or at least prevent a significant decrease (if it plays honest). Even if trust decreases over time, total expected trust would still be increasing as long as at least a small fraction of people still trusts in the machine. So the problem here is not so much that the AI would have an incentive to kill all humans but that it may have an incentive to switch to deception, if this becomes the more effective strategy at some point.