Reading the link and some reference abstracts, I think my last comment already had that in mind. The idea here is that a certain kind of AI would accelerate a certain kind of progress more than another, because of the approach we used to align it, and on reflection we would not want this. But surely if it is aligned, and therefore corrigible, this should be no problem?
Here’s a toy example that might make the idea clearer. Suppose we lived in a world that hasn’t invented nuclear weapons yet, and someone creates an aligned AI that is really good at developing nuclear weapon technology and only a little bit better than humans on everything else. Even though everyone would prefer that nobody develops nuclear weapons, the invention of this aligned AI (if more than one nation had access to it, and “aligned” means aligned to the user) would accelerate the development of nuclear weapons relative to every other kind of intellectual progress and thereby reduce the expected value of the universe.
So you want to align the AI with us rather than its user by choosing the alignment approach it uses. If it’s corrigible towards its user, won’t it acquire the capabilities of the other approach in short order to better serve its user? Or is retrofitting the other approach also a blind spot of your proposed approach?
If it’s corrigible towards its user, won’t it acquire the capabilities of the other approach in short order to better serve its user?
Yes, that seems like an issue.
Or is retrofitting the other approach also a blind spot of your proposed approach?
That’s one possible solution. Another one might be to create an aligned AI that is especially good at coordinating with other AIs, so that these AIs can make an agreement with each other to not develop nuclear weapons before they invent the AI that is especially good at developing nuclear weapons. (But would corrigibility imply that the user can always override such agreements?) There may be other solutions that I’m not thinking of. If all else fails, it may be that the only way to avoid AI-caused differential intellectual progress in a bad direction is to stop the development of AI.
Please reword your last idea. There is a possible aligned AI that is biased in its research and will ignore people telling it so?
I think that section will only make sense if you’re familiar with the concept of differential intellectual progress. The wiki page I linked to is a bit outdated, so try https://concepts.effectivealtruism.org/concepts/differential-progress/ and its references instead.
Reading the link and some reference abstracts, I think my last comment already had that in mind. The idea here is that a certain kind of AI would accelerate a certain kind of progress more than another, because of the approach we used to align it, and on reflection we would not want this. But surely if it is aligned, and therefore corrigible, this should be no problem?
Here’s a toy example that might make the idea clearer. Suppose we lived in a world that hasn’t invented nuclear weapons yet, and someone creates an aligned AI that is really good at developing nuclear weapon technology and only a little bit better than humans on everything else. Even though everyone would prefer that nobody develops nuclear weapons, the invention of this aligned AI (if more than one nation had access to it, and “aligned” means aligned to the user) would accelerate the development of nuclear weapons relative to every other kind of intellectual progress and thereby reduce the expected value of the universe.
Does that make more sense now?
So you want to align the AI with us rather than its user by choosing the alignment approach it uses. If it’s corrigible towards its user, won’t it acquire the capabilities of the other approach in short order to better serve its user? Or is retrofitting the other approach also a blind spot of your proposed approach?
Yes, that seems like an issue.
That’s one possible solution. Another one might be to create an aligned AI that is especially good at coordinating with other AIs, so that these AIs can make an agreement with each other to not develop nuclear weapons before they invent the AI that is especially good at developing nuclear weapons. (But would corrigibility imply that the user can always override such agreements?) There may be other solutions that I’m not thinking of. If all else fails, it may be that the only way to avoid AI-caused differential intellectual progress in a bad direction is to stop the development of AI.