it seems reasonably likely that it only takes fractionally more resources to train a question answering system on top of the model, if it only has to use knowledge “already in” the model, which would let it be competitive, while still preventing mesa optimizers from arising (if the alignment scheme does its job).
I agree, but it seems to me that coming up with an alignment scheme (for amplification/debate) that “does its job” while preserving competitiveness is an “alignment-hard” problem. I like the OP because I see it as an attempt to reason about how alignment schemes of amplification/debate might work.
I agree, but it seems to me that coming up with an alignment scheme (for amplification/debate) that “does its job” while preserving competitiveness is an “alignment-hard” problem. I like the OP because I see it as an attempt to reason about how alignment schemes of amplification/debate might work.