This post points out a potential problem for <@Prosaic AI alignment@>, in which we try to align AI systems built using current techniques. Consider some prosaic alignment scheme, such as <@iterated amplification@>(@Learning Complex Goals with Iterated Amplification@) or <@debate@>(@AI safety via debate@). If we try to train an AI system directly using such a scheme, it will likely be uncompetitive, since it seems likely that the most powerful AI systems will probably require cutting-edge algorithms, architectures, objectives, and environments, at least some of which will be replaced by new versions from the safety scheme. Alternatively, we could first train a general AI system, and then use our alignment scheme to finetune it into an aligned AI system. However, this runs the risk that the initial training could create a misaligned mesa optimizer, that then deliberately sabotages our finetuning efforts.
Planned opinion:
The comments reveal a third possibility: the alignment scheme could be trained jointly alongside the cutting edge AI training. For example, we might hope that we can train a question answerer that can answer questions about anything “the model already knows”, and this question answering system is trained simultaneously with the training of the model itself. I think this takes the “oomph” out of the dilemma as posed here—it seems reasonably likely that it only takes fractionally more resources to train a question answering system on top of the model, if it only has to use knowledge “already in” the model, which would let it be competitive, while still preventing mesa optimizers from arising (if the alignment scheme does its job). Of course, it may turn out that it takes a huge amount of resources to train the question answering system, making the system uncompetitive, but that seems hard to predict given our current knowledge.
it seems reasonably likely that it only takes fractionally more resources to train a question answering system on top of the model, if it only has to use knowledge “already in” the model, which would let it be competitive, while still preventing mesa optimizers from arising (if the alignment scheme does its job).
I agree, but it seems to me that coming up with an alignment scheme (for amplification/debate) that “does its job” while preserving competitiveness is an “alignment-hard” problem. I like the OP because I see it as an attempt to reason about how alignment schemes of amplification/debate might work.
Comment on your planned opinion: I mostly agree; I think what this means is that prosaic AI safety depends somewhat on an empirical premise: That joint training doesn’t bring a major competitiveness penalty. I guess I only disagree insofar as I’m a bit more skeptical of that premise. What does the current evidence on joint training say on the matter? I have no idea, but I am under the impression that you can’t just take an existing training process—such as the one that made AlphaStar—and mix in some training tasks from a completely different domain and expect it to work. This seems like evidence against the premise to me. As someone (Paul?) pointed out in the comments when I said this, this point applies to fine-tuning as well. But if so that just means that the second and third ways of the dilemma are both uncompetitive, which means prosaic AI safety is uncompetitive in general.
prosaic AI safety depends somewhat on an empirical premise: That joint training doesn’t bring a major competitiveness penalty.
Yeah, this is why I said:
Of course, it may turn out that it takes a huge amount of resources to train the question answering system, making the system uncompetitive, but that seems hard to predict given our current knowledge.
you can’t just take an existing training process—such as the one that made AlphaStar—and mix in some training tasks from a completely different domain and expect it to work.
From a completely different domain, yeah, that probably won’t work well (though I’d still guess less than an order of magnitude slowdown). But as I understand it, the goal is to train a question answering system that answers questions related to the domain, e.g. for Starcraft you might ask the model questions about the best way to counter a particular strategy, or why it deploys a particular kind of unit in a certain situation. This depends on similar underlying features / concepts as playing Starcraft well, and adding training tasks of this form can often improve performance, e.g. One Model To Learn Them All.
Planned summary for the Alignment newsletter:
Planned opinion:
I agree, but it seems to me that coming up with an alignment scheme (for amplification/debate) that “does its job” while preserving competitiveness is an “alignment-hard” problem. I like the OP because I see it as an attempt to reason about how alignment schemes of amplification/debate might work.
Thanks! I endorse that summary.
Comment on your planned opinion: I mostly agree; I think what this means is that prosaic AI safety depends somewhat on an empirical premise: That joint training doesn’t bring a major competitiveness penalty. I guess I only disagree insofar as I’m a bit more skeptical of that premise. What does the current evidence on joint training say on the matter? I have no idea, but I am under the impression that you can’t just take an existing training process—such as the one that made AlphaStar—and mix in some training tasks from a completely different domain and expect it to work. This seems like evidence against the premise to me. As someone (Paul?) pointed out in the comments when I said this, this point applies to fine-tuning as well. But if so that just means that the second and third ways of the dilemma are both uncompetitive, which means prosaic AI safety is uncompetitive in general.
Yeah, this is why I said:
From a completely different domain, yeah, that probably won’t work well (though I’d still guess less than an order of magnitude slowdown). But as I understand it, the goal is to train a question answering system that answers questions related to the domain, e.g. for Starcraft you might ask the model questions about the best way to counter a particular strategy, or why it deploys a particular kind of unit in a certain situation. This depends on similar underlying features / concepts as playing Starcraft well, and adding training tasks of this form can often improve performance, e.g. One Model To Learn Them All.