This post argues against alignment protocols based on outsourcing alignment research to AI. It makes some good points, but also feels insufficiently charitable to the proposals it’s criticizing.
John make his case by an analogy to human experts. If you’re hiring an expert in domain X, but you understand little in domain X yourself then you’re going to have 3 serious problems:
Illusion of transparency: the expert might say things that you misinterpret due to your own lack of understanding.
The expert might be dumb or malicious, but you will believe them due to your own ignorance.
When the failure modes above happen, you won’t be aware of this and won’t act to fix them.
These points are relevant. However, they don’t fully engage with the main source of hope for outsourcing proponents. Namely, it’s the principle that validation is easier than generation[1]. While it’s true that an arbitrary dilettante might not benefit from an arbitrary expert, the fact that it’s easier to comprehend an idea than invent it yourself means that we can get some value from outsourcing, under some half-plausible conditions.
The claim that the “AI expert” can be deceptive and/or malicious is straightforwardly true. I think that the best hope to address it would be something like Autocalibrated Quantilized Debate, but it does require some favorable assumptions about the feasibility of deception and inner alignment is still a problem.
The “illusion of transparency” argument is more confusing IMO. The obvious counterargument is, imagine an AI that is trained to not only produce correct answers but also explain them in a way that’s as useful as possible for the audience. However, there are two issues with this counterargument:
First, how do we know that the generalization from the training data to the real use case (alignment research) is reliable? Given that we cannot reliably test the real use case, precisely because we are alignment dilettantes?
Second, we might be following a poor metastrategy. It is easy to imagine, in the world we currently inhabit, that an AI lab creates catastrophic unaligned AI, even though they think they care about alignment, just because they are too reckless and overconfident. By the same token, we can imagine such an AI lab consulting their own AI about alignment, and then proceeding with the reckless and overconfident plans suggested by the AI.
In the context of a sufficiently cautious metastrategy, it is not implausible that we can get some mileage from the outsourcing approach[2]. Move one step at a time, spend a lot of time reflecting on the AI’s proposals, and also have strong guardrails against the possibility of superhuman deception or inner alignment failures (which we currently don’t know how to build!) But without this context, we are indeed liable to become the clients in the satiric video John linked.
I think that John might disagree with this principle. A world in which the principle is mostly false would be peculiar. It would be a world in which marketplaces of ideas don’t work at all, and even if someone fully solves AI alignment they will fail to convince most relevant people that their solution is correct (any more than someone with an incorrect solution would succeed in that). I don’t think that’s the world we live in.
This post argues against alignment protocols based on outsourcing alignment research to AI. It makes some good points, but also feels insufficiently charitable to the proposals it’s criticizing.
John make his case by an analogy to human experts. If you’re hiring an expert in domain X, but you understand little in domain X yourself then you’re going to have 3 serious problems:
Illusion of transparency: the expert might say things that you misinterpret due to your own lack of understanding.
The expert might be dumb or malicious, but you will believe them due to your own ignorance.
When the failure modes above happen, you won’t be aware of this and won’t act to fix them.
These points are relevant. However, they don’t fully engage with the main source of hope for outsourcing proponents. Namely, it’s the principle that validation is easier than generation[1]. While it’s true that an arbitrary dilettante might not benefit from an arbitrary expert, the fact that it’s easier to comprehend an idea than invent it yourself means that we can get some value from outsourcing, under some half-plausible conditions.
The claim that the “AI expert” can be deceptive and/or malicious is straightforwardly true. I think that the best hope to address it would be something like Autocalibrated Quantilized Debate, but it does require some favorable assumptions about the feasibility of deception and inner alignment is still a problem.
The “illusion of transparency” argument is more confusing IMO. The obvious counterargument is, imagine an AI that is trained to not only produce correct answers but also explain them in a way that’s as useful as possible for the audience. However, there are two issues with this counterargument:
First, how do we know that the generalization from the training data to the real use case (alignment research) is reliable? Given that we cannot reliably test the real use case, precisely because we are alignment dilettantes?
Second, we might be following a poor metastrategy. It is easy to imagine, in the world we currently inhabit, that an AI lab creates catastrophic unaligned AI, even though they think they care about alignment, just because they are too reckless and overconfident. By the same token, we can imagine such an AI lab consulting their own AI about alignment, and then proceeding with the reckless and overconfident plans suggested by the AI.
In the context of a sufficiently cautious metastrategy, it is not implausible that we can get some mileage from the outsourcing approach[2]. Move one step at a time, spend a lot of time reflecting on the AI’s proposals, and also have strong guardrails against the possibility of superhuman deception or inner alignment failures (which we currently don’t know how to build!) But without this context, we are indeed liable to become the clients in the satiric video John linked.
I think that John might disagree with this principle. A world in which the principle is mostly false would be peculiar. It would be a world in which marketplaces of ideas don’t work at all, and even if someone fully solves AI alignment they will fail to convince most relevant people that their solution is correct (any more than someone with an incorrect solution would succeed in that). I don’t think that’s the world we live in.
Although currently I consider PSI to be more promising.