I’m imagining cooperative bargaining between all users, where the disagreement point is everyone dying[1][2] (this is a natural choice assuming that if we don’t build aligned TAI we get paperclips). This guarantees that every user will receive an outcome that’s at least not worse than death.
With Nash bargaining, we can still get issues for (in)famous people that millions of people want to do unpleasant things to. Their outcome will be better than death, but maybe worse than in my claimed “lower bound”.
With Kalai-Smorodinsky bargaining things look better, since essentially we’re maximizing a minimum over all users. This should admit my lower bound, unless it is somehow disrupted by enormous asymmetries in the maximal payoffs of different users.
In either case, we might need to do some kind of outlier filtering: if e.g. literally every person on Earth is a user, then maybe some of them are utterly insane in ways that cause the Pareto frontier to collapse.
Bargaining assumes we can access the utility function. In reality, even if we solve the value learning problem in the single user case, once you go to the multi-user case it becomes a mechanism design problem: users have incentives to lie / misrepresent their utility functions. A perfect solution might be impossible, but I proposed mitigating this by assigning each user a virtual “AI lawyer” that provides optimal input on their behalf into the bargaining system. In this case they at least have no incentive to lie to the lawyer, and the outcome will not be skewed in favor of users who are better in this game, but we don’t get the optimal bargaining solution either.
All of this assumes the TAI is based on some kind of value learning. If the first-stage TAI is based on something else, the problem might become easier or harder. Easier because the first-stage TAI will produce better solutions to the multi-user problem for the second-stage TAI. Harder because it can allow the small group of people controlling it to impose their own preferences.
For IDA-of-imitation, democratization seems like a hard problem because the mechanism by which IDA-of-imitation solves AI risk is precisely by empowering a small group of people over everyone else (since the source of AI risk comes from other people launching unaligned TAI). Adding transparency can entirely undermine safety.
For quantilized debate, adding transparency opens us to an attack vector where the AI manipulates public opinion. This significantly lowers the optimization pressure bar for manipulation, compared to manipulating the (carefully selected) judges, which might undermine the key assumption that effective dishonest strategies are harder to find than effective honest strategies.
This can be formalized by literally having the AI consider the possibility of optimizing for some unaligned utility function. This is a weird and risky approach but it works to 1st approximation.
Bargaining assumes we can access the utility function. In reality, even if we solve the value learning problem in the single user case, once you go to the multi-user case it becomes a mechanism design problem: users have incentives to lie / misrepresent their utility functions. A perfect solution might be impossible, but I proposed mitigating this by assigning each user a virtual “AI lawyer” that provides optimal input on their behalf into the bargaining system. In this case they at least have no incentive to lie to the lawyer, and the outcome will not be skewed in favor of users who are better in this game, but we don’t get the optimal bargaining solution either.
Assuming each lawyer has the same incentive to lie as its client, it has an incentive to misrepresent that some preferable-to-death outcomes are “worse-than-death” (in order to force those outcomes out of the set of “feasible agreements” in hope of getting a more preferred outcome as the actual outcome), and this at equilibrium is balanced by the marginal increase in the probability of getting “everyone dies” as the outcome (due to feasible agreements becoming a null set) caused by the lie. So the probability of “everyone dies” in this game has to be non-zero.
(It’s the same kind of problem as in the AI race or tragedy of commons: people not taking into account the full social costs of their actions as they reach for private benefits.)
Of course in actuality everyone dying may not be a realistic consequence of failure to reach agreement, but if the real consequence is better than that, and the AI lawyers know this, they would be more willing to lie since the perceived downside of lying would be smaller, so you end up with a higher chance of no agreement.
Yes, it’s not a very satisfactory solution. Some alternative/complementary solutions:
Somehow use non-transformative AI to do my mind uploading, and then have the TAI to learn by inspecting the uploads. Would be great for single-user alignment as well.
Somehow use non-transformative AI to create perfect lie detectors, and use this to enforce honesty in the mechanism. (But, is it possible to detect self-deception?)
Have the TAI learn from past data which wasn’t affected by the incentives created by the TAI. (But, is there enough information there?)
Shape the TAI’s prior about human values in order to rule out at least the most blatant lies.
Some clever mechanism design I haven’t thought of. The problem with this is, most mechanism designs rely on money and money that doesn’t seem applicable, whereas when you don’t have money there are many impossibility theorems.
In either case, we might need to do some kind of outlier filtering: if e.g. literally every person on Earth is a user, then maybe some of them are utterly insane in ways that cause the Pareto frontier to collapse.
This seems near guaranteed to me: a non-zero amount of people will be that crazy (in our terms), so filtering will be necessary.
Then I’m curious about how we draw the line on outlier filtering. What filtering rule do we use? I don’t yet see a good principled rule (e.g. if we want to throw out people who’d collapse agreement to the disagreement point, there’s more than one way to do that).
Maybe crazy behaviour correlates with less intelligence
Depending what we mean by ‘crazy’ I think that’s unlikely—particularly when what we care about here are highly unusual moral stances. I’d see intelligence as a multiplier, rather than something which points you in the ‘right’ direction. Outliers will be at both extremes of intelligence—and I think you’ll get a much wider moral variety on the high end.
For instance, I don’t think you’ll find many low-intelligence antinatalists—and here I mean the stronger, non-obvious claim: not simply that most people calling themselves antinatalists, or advocating for antinatalism will have fairly high intelligence, but rather that most people with such a moral stance (perhaps not articulated) will have fairly high intelligence.
Generally, I think there are many weird moral stances you might think your way into that you’d be highly unlikely to find ‘naturally’ (through e.g. absorption of cultural norms). I’d also expect creativity to positively correlate with outlier moralities. Minds that habitually throw together seven disparate concepts will find crazier notions than those which don’t get beyond three.
First, I think we want to be thinking in terms of [personal morality we’d reflectively endorse] rather than [all the base, weird, conflicting… drivers of behaviour that happen to be in our heads].
There are things most of us would wish to change about ourselves if we could. There’s no sense in baking them in for all eternity (or bargaining on their behalf), just because they happen to form part of what drives us now. [though one does have to be a bit careful here, since it’s easy to miss the upside of qualities we regard as flaws]
With this in mind, reflectively endorsed antinatalism really is a problem: yes, some people will endorse sacrificing everything just to get to a world where there’s no suffering (because there are no people).
Note that the kinds of bargaining approach Vanessa is advocating are aimed at guaranteeing a lower bound for everyone (who’s not pre-filtered out) - so you only need to include one person with a particularly weird view to fail to reach a sensible bargain. [though her most recent version should avoid this]
I’m imagining cooperative bargaining between all users, where the disagreement point is everyone dying[1][2] (this is a natural choice assuming that if we don’t build aligned TAI we get paperclips). This guarantees that every user will receive an outcome that’s at least not worse than death.
With Nash bargaining, we can still get issues for (in)famous people that millions of people want to do unpleasant things to. Their outcome will be better than death, but maybe worse than in my claimed “lower bound”.
With Kalai-Smorodinsky bargaining things look better, since essentially we’re maximizing a minimum over all users. This should admit my lower bound, unless it is somehow disrupted by enormous asymmetries in the maximal payoffs of different users.
In either case, we might need to do some kind of outlier filtering: if e.g. literally every person on Earth is a user, then maybe some of them are utterly insane in ways that cause the Pareto frontier to collapse.
[EDIT: see improved solution]
Bargaining assumes we can access the utility function. In reality, even if we solve the value learning problem in the single user case, once you go to the multi-user case it becomes a mechanism design problem: users have incentives to lie / misrepresent their utility functions. A perfect solution might be impossible, but I proposed mitigating this by assigning each user a virtual “AI lawyer” that provides optimal input on their behalf into the bargaining system. In this case they at least have no incentive to lie to the lawyer, and the outcome will not be skewed in favor of users who are better in this game, but we don’t get the optimal bargaining solution either.
All of this assumes the TAI is based on some kind of value learning. If the first-stage TAI is based on something else, the problem might become easier or harder. Easier because the first-stage TAI will produce better solutions to the multi-user problem for the second-stage TAI. Harder because it can allow the small group of people controlling it to impose their own preferences.
For IDA-of-imitation, democratization seems like a hard problem because the mechanism by which IDA-of-imitation solves AI risk is precisely by empowering a small group of people over everyone else (since the source of AI risk comes from other people launching unaligned TAI). Adding transparency can entirely undermine safety.
For quantilized debate, adding transparency opens us to an attack vector where the AI manipulates public opinion. This significantly lowers the optimization pressure bar for manipulation, compared to manipulating the (carefully selected) judges, which might undermine the key assumption that effective dishonest strategies are harder to find than effective honest strategies.
This can be formalized by literally having the AI consider the possibility of optimizing for some unaligned utility function. This is a weird and risky approach but it works to 1st approximation.
An alternative choice of disagreement point is maximizing the utility of a randomly chosen user. This has advantages and disadvantages.
Assuming each lawyer has the same incentive to lie as its client, it has an incentive to misrepresent that some preferable-to-death outcomes are “worse-than-death” (in order to force those outcomes out of the set of “feasible agreements” in hope of getting a more preferred outcome as the actual outcome), and this at equilibrium is balanced by the marginal increase in the probability of getting “everyone dies” as the outcome (due to feasible agreements becoming a null set) caused by the lie. So the probability of “everyone dies” in this game has to be non-zero.
(It’s the same kind of problem as in the AI race or tragedy of commons: people not taking into account the full social costs of their actions as they reach for private benefits.)
Of course in actuality everyone dying may not be a realistic consequence of failure to reach agreement, but if the real consequence is better than that, and the AI lawyers know this, they would be more willing to lie since the perceived downside of lying would be smaller, so you end up with a higher chance of no agreement.
Yes, it’s not a very satisfactory solution. Some alternative/complementary solutions:
Somehow use non-transformative AI to do my mind uploading, and then have the TAI to learn by inspecting the uploads. Would be great for single-user alignment as well.
Somehow use non-transformative AI to create perfect lie detectors, and use this to enforce honesty in the mechanism. (But, is it possible to detect self-deception?)
Have the TAI learn from past data which wasn’t affected by the incentives created by the TAI. (But, is there enough information there?)
Shape the TAI’s prior about human values in order to rule out at least the most blatant lies.
Some clever mechanism design I haven’t thought of. The problem with this is, most mechanism designs rely on money and money that doesn’t seem applicable, whereas when you don’t have money there are many impossibility theorems.
This seems near guaranteed to me: a non-zero amount of people will be that crazy (in our terms), so filtering will be necessary.
Then I’m curious about how we draw the line on outlier filtering. What filtering rule do we use? I don’t yet see a good principled rule (e.g. if we want to throw out people who’d collapse agreement to the disagreement point, there’s more than one way to do that).
Depending what we mean by ‘crazy’ I think that’s unlikely—particularly when what we care about here are highly unusual moral stances. I’d see intelligence as a multiplier, rather than something which points you in the ‘right’ direction. Outliers will be at both extremes of intelligence—and I think you’ll get a much wider moral variety on the high end.
For instance, I don’t think you’ll find many low-intelligence antinatalists—and here I mean the stronger, non-obvious claim: not simply that most people calling themselves antinatalists, or advocating for antinatalism will have fairly high intelligence, but rather that most people with such a moral stance (perhaps not articulated) will have fairly high intelligence.
Generally, I think there are many weird moral stances you might think your way into that you’d be highly unlikely to find ‘naturally’ (through e.g. absorption of cultural norms).
I’d also expect creativity to positively correlate with outlier moralities. Minds that habitually throw together seven disparate concepts will find crazier notions than those which don’t get beyond three.
First, I think we want to be thinking in terms of [personal morality we’d reflectively endorse] rather than [all the base, weird, conflicting… drivers of behaviour that happen to be in our heads].
There are things most of us would wish to change about ourselves if we could. There’s no sense in baking them in for all eternity (or bargaining on their behalf), just because they happen to form part of what drives us now. [though one does have to be a bit careful here, since it’s easy to miss the upside of qualities we regard as flaws]
With this in mind, reflectively endorsed antinatalism really is a problem: yes, some people will endorse sacrificing everything just to get to a world where there’s no suffering (because there are no people).
Note that the kinds of bargaining approach Vanessa is advocating are aimed at guaranteeing a lower bound for everyone (who’s not pre-filtered out) - so you only need to include one person with a particularly weird view to fail to reach a sensible bargain. [though her most recent version should avoid this]