Going back to the translation example, I can understand your motivation to restrict attention to some subset of all AI techniques. But I think it’s reasonable for people to expect that if you’re aiming to be competitive with a certain kind of AI, you’ll also aim to avoid ending up not being competitive with minor variations of your own design (in this case, forms of iterated amplification that don’t break down tasks into such small pieces). Otherwise, aren’t you “cheating” by letting aligned AIs use AI techniques that their competitors aren’t allowed to use?
To put it another way, people clearly get the impression from you that there’s hope that IDA can simultaneously be aligned and achieve state of the art performance at runtime. See this post where Ajeya Cotra says exactly this:
The hope is that if we use IDA to train each learned component of an AI then the overall AI will remain aligned with the user’s interests while achieving state of the art performance at runtime — provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance.
But the actual situation seems to be that at best IDA can either be aligned (if you break down tasks enough) or achieve state of the art performance (if you don’t), but not both at the same time.
In general, if you have some useful but potentially malign data source (humans, in the translation example) then that’s a possible problem—whether you learn from the data source or merely consult it.
You have to solve each instance of that problem in a way that depends on the details of the data source. In the translation example, you need to actually reason about human psychology. In the case of SETI, we need to coordinate to not use malign alien messages (or else opt to let the aliens take over).
Otherwise, aren’t you “cheating” by letting aligned AIs use AI techniques that their competitors aren’t allowed to use?
I’m just trying to compete with a particular set of AI techniques. Then every time you would have used those (potentially dangerous) techniques, you can instead use the safe alternative we’ve developed.
If there are other ways to make your AI more powerful, you have to deal with those on your own. That may be learning from human abilities that are entangled with malign behavior in complex ways, or using an AI design that you found in an alien message, or using an unsafe physical process in order to generate large amounts of power, or whatever.
I grant that my definition of the alignment problem would count “learn from malign data source” as an alignment problem, since you ultimately end up with a malign AI, but that problem occurs with or without AI and I don’t think it is deceptive to factor that problem out (but I agree that I should be more careful about the statement / switch to a more refined statement).
I also don’t think it’s a particularly important problem. And it’s not what people usually have in mind as a failure mode—I’ve discussed this problem with a few people, to try to explain some subtleties of the alignment problem, and most people hadn’t thought about it and were pretty skeptical. So in those respects I think it’s basically fine.
When Ajeya says:
provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance.
This is meant to include things like “You don’t have a malign data source that you are learning from.” I agree that it’s slightly misleading if we think that humans are such a data source.
Going back to the translation example, I can understand your motivation to restrict attention to some subset of all AI techniques. But I think it’s reasonable for people to expect that if you’re aiming to be competitive with a certain kind of AI, you’ll also aim to avoid ending up not being competitive with minor variations of your own design (in this case, forms of iterated amplification that don’t break down tasks into such small pieces). Otherwise, aren’t you “cheating” by letting aligned AIs use AI techniques that their competitors aren’t allowed to use?
To put it another way, people clearly get the impression from you that there’s hope that IDA can simultaneously be aligned and achieve state of the art performance at runtime. See this post where Ajeya Cotra says exactly this:
But the actual situation seems to be that at best IDA can either be aligned (if you break down tasks enough) or achieve state of the art performance (if you don’t), but not both at the same time.
In general, if you have some useful but potentially malign data source (humans, in the translation example) then that’s a possible problem—whether you learn from the data source or merely consult it.
You have to solve each instance of that problem in a way that depends on the details of the data source. In the translation example, you need to actually reason about human psychology. In the case of SETI, we need to coordinate to not use malign alien messages (or else opt to let the aliens take over).
I’m just trying to compete with a particular set of AI techniques. Then every time you would have used those (potentially dangerous) techniques, you can instead use the safe alternative we’ve developed.
If there are other ways to make your AI more powerful, you have to deal with those on your own. That may be learning from human abilities that are entangled with malign behavior in complex ways, or using an AI design that you found in an alien message, or using an unsafe physical process in order to generate large amounts of power, or whatever.
I grant that my definition of the alignment problem would count “learn from malign data source” as an alignment problem, since you ultimately end up with a malign AI, but that problem occurs with or without AI and I don’t think it is deceptive to factor that problem out (but I agree that I should be more careful about the statement / switch to a more refined statement).
I also don’t think it’s a particularly important problem. And it’s not what people usually have in mind as a failure mode—I’ve discussed this problem with a few people, to try to explain some subtleties of the alignment problem, and most people hadn’t thought about it and were pretty skeptical. So in those respects I think it’s basically fine.
When Ajeya says:
This is meant to include things like “You don’t have a malign data source that you are learning from.” I agree that it’s slightly misleading if we think that humans are such a data source.