Here the alignment concern is that we aren’t, actually, able to exert adequate selection pressure in this manner. But this, to me, seems like a notably open empirical question.
I think the usual concern is not whether this is possible in principle, but whether we’re likely to make it happen the first time we develop an AI that is both motivated to attempt and likely to succeed at takeover. (My guess is that you understand this, based on your previous writing addressing the idea of first critical tries, but there does exist a niche view that alignment in the relevant sense is impossible and not merely very difficult to achieve under the relevant constraints, and arguments against that view look very different from arguments about the empirical difficulty of value alignment, likelihood of various default outcomes, etc).
I agree that it’s useful to model AI’s incentives for takeover in worlds where it’s not sufficiently superhuman to have a very high likelihood of success. I’ve tried to do some of that, though I didn’t attend to questions about how likely it is that we’d be able to “block off” the (hopefully much smaller number of) plausible routes to takeover for AIs which have a level of capabilities that don’t imply an overdetermined success.
I think I am more pessimistic than you are about how much such AIs would value the “best benign alternatives”—my guess is very close to zero, since I expect ~no overlap in values and that we won’t be able to succesfully engage in schemes like pre-committing to sharing the future value of the Lightcone conditional on the AI being cooperative[1]. Separately, I expect that if we attempt to maneuver such AIs into positions where their highest-EV plan is something we’d consider to have benign long-run consequences, we will instead end up in situations where their plans are optimized to hit the pareto-frontier of “look benign” and “tilt the playing field further in the AI’s favor”. (This is part of what the Control agenda is trying to address.)
Credit-assignment actually doesn’t seem like the hard part, conditional on reaching aligned ASI. I’m skeptical of the part where we have a sufficiently capable AI that its help is useful in us reaching an aligned ASI, but it still prefers to help us because it thinks that its estimated odds of a successful takeover imply less future utility for itself than a fair post-facto credit assignment would give it, for its help. Having that calculation come out in our favor feels pretty doomed to me, if you’ve got the AI as a core part of your loop for developing future AIs, since it relies on some kind of scalable verification scheme and none of the existing proposals make me very optimistic.
I think the usual concern is not whether this is possible in principle, but whether we’re likely to make it happen the first time we develop an AI that is both motivated to attempt and likely to succeed at takeover. (My guess is that you understand this, based on your previous writing addressing the idea of first critical tries, but there does exist a niche view that alignment in the relevant sense is impossible and not merely very difficult to achieve under the relevant constraints, and arguments against that view look very different from arguments about the empirical difficulty of value alignment, likelihood of various default outcomes, etc).
I agree that it’s useful to model AI’s incentives for takeover in worlds where it’s not sufficiently superhuman to have a very high likelihood of success. I’ve tried to do some of that, though I didn’t attend to questions about how likely it is that we’d be able to “block off” the (hopefully much smaller number of) plausible routes to takeover for AIs which have a level of capabilities that don’t imply an overdetermined success.
I think I am more pessimistic than you are about how much such AIs would value the “best benign alternatives”—my guess is very close to zero, since I expect ~no overlap in values and that we won’t be able to succesfully engage in schemes like pre-committing to sharing the future value of the Lightcone conditional on the AI being cooperative[1]. Separately, I expect that if we attempt to maneuver such AIs into positions where their highest-EV plan is something we’d consider to have benign long-run consequences, we will instead end up in situations where their plans are optimized to hit the pareto-frontier of “look benign” and “tilt the playing field further in the AI’s favor”. (This is part of what the Control agenda is trying to address.)
Credit-assignment actually doesn’t seem like the hard part, conditional on reaching aligned ASI. I’m skeptical of the part where we have a sufficiently capable AI that its help is useful in us reaching an aligned ASI, but it still prefers to help us because it thinks that its estimated odds of a successful takeover imply less future utility for itself than a fair post-facto credit assignment would give it, for its help. Having that calculation come out in our favor feels pretty doomed to me, if you’ve got the AI as a core part of your loop for developing future AIs, since it relies on some kind of scalable verification scheme and none of the existing proposals make me very optimistic.