Trivially, any AI smart enough to be truly dangerous is capable of doing our “alignment homework” for us, in the sense of having enough intelligence to solve the problem. This is something EY has also pointed out many times, but which often gets ignored. Any ASI that destroys humanity will have no problem whatsoever understanding that that’s not what humanity wanted, and no difficulty figuring out what things we would have wanted it to do instead.
What is very different and less clear of a claim is whether we can use any AI developed with sufficient capabilities, but built before the “homework” was done, to do so safely (for likely/plausible definitions of “we” and “use”).
Trivially, any AI smart enough to be truly dangerous is capable of doing our “alignment homework” for us, in the sense of having enough intelligence to solve the problem.
Is this trivial? People at least argue that the capability profile could be sufficiently unfortunate such that AIs are extremely dangerous prior to being extremely useful. (As a particularly extreme case, people often argue that AIs will be qualitatively wildly superhumans in dangerous skills (e.g. persuation) prior to being merely qualitatively human level at doing AI safety research. See here for some discussion and see here for an example of someone arguing for AIs having qualitatively wildly superhuman abilities prior to being sufficiently useful in general.)
Of course, this could lead to an amusing case where AIs takeover the world and then need to employ human safety/alignment researchers to solve their alignment homework for them : ).
Fair enough, “trivial” overstates the case. I do think it is overwhelmingly likely.
That said, I’m not sure how much we actually disagree on this? I was mostly trying to highlight the gap between an AI have a capability and us having the control to use an AI to usefully benefit from that capability.
I personally agree that on the default trajectory it’s very likely that at the point where AIs are quite existentially dangerous (in the absense of serious countermeasures) they also are capable of being very useful (though misalignment might make them hard to use).
However, I think this is a key disagreement I have with more pessimistic people who think that at the point where models become useful, they’re also qualitiatively wildly superhumanly dangerous. And this implies (assuming some rough notion of continuity) that there were earlier AIs which weren’t very useful but which were still dangerous in some ways.
Yeah, there are lots of ways to be useful, and not all require any superhuman capabilities. How much is broadly-effective intelligence vs targeted capabilities development (seems like more the former lately), how much is cheap-but-good-enough compared to humans vs better-than-human along some axis, etc.
Trivially, any AI smart enough to be truly dangerous is capable of doing our “alignment homework” for us, in the sense of having enough intelligence to solve the problem. This is something EY has also pointed out many times, but which often gets ignored. Any ASI that destroys humanity will have no problem whatsoever understanding that that’s not what humanity wanted, and no difficulty figuring out what things we would have wanted it to do instead.
What is very different and less clear of a claim is whether we can use any AI developed with sufficient capabilities, but built before the “homework” was done, to do so safely (for likely/plausible definitions of “we” and “use”).
Is this trivial? People at least argue that the capability profile could be sufficiently unfortunate such that AIs are extremely dangerous prior to being extremely useful. (As a particularly extreme case, people often argue that AIs will be qualitatively wildly superhumans in dangerous skills (e.g. persuation) prior to being merely qualitatively human level at doing AI safety research. See here for some discussion and see here for an example of someone arguing for AIs having qualitatively wildly superhuman abilities prior to being sufficiently useful in general.)
Of course, this could lead to an amusing case where AIs takeover the world and then need to employ human safety/alignment researchers to solve their alignment homework for them : ).
Fair enough, “trivial” overstates the case. I do think it is overwhelmingly likely.
That said, I’m not sure how much we actually disagree on this? I was mostly trying to highlight the gap between an AI have a capability and us having the control to use an AI to usefully benefit from that capability.
I personally agree that on the default trajectory it’s very likely that at the point where AIs are quite existentially dangerous (in the absense of serious countermeasures) they also are capable of being very useful (though misalignment might make them hard to use).
However, I think this is a key disagreement I have with more pessimistic people who think that at the point where models become useful, they’re also qualitiatively wildly superhumanly dangerous. And this implies (assuming some rough notion of continuity) that there were earlier AIs which weren’t very useful but which were still dangerous in some ways.
Yeah, there are lots of ways to be useful, and not all require any superhuman capabilities. How much is broadly-effective intelligence vs targeted capabilities development (seems like more the former lately), how much is cheap-but-good-enough compared to humans vs better-than-human along some axis, etc.
Petition to change the title of this post to “Can we get AIs to do our alignment homework for us?”
Updated