I think that your comment on (1) is too pessimistic about the possibility of stopping deployment of a misaligned model. That would be a pretty massive result! I think this would have cultural or policy benefits that are pretty diverse, so I don’t think I agree that this is always a losing strategy—after revealing misalignment to relevant actors and preventing deployment, the space of options grows a lot.
I’m not sure anything I wrote in (2) is close to understanding all the important safety properties of an AI. For example, the grokking work doesn’t explain all the inductive biases of transformers/Adam, but it has helped better reasoning about transformers/Adam. Is there something I’m missing?
On (3) I think that rewarding the correct mechanisms in models is basically an extension of process-based feedback. This may be infeasible or only be possible while applying lots of optimization pressure on a model’s cognition (which would be worrying for the various list of lethalities reasons). Are these the reasons you’re pessimistic about this, or something else?
I like your writing, but AFAIK you haven’t written your thoughts on interp prerequisites for useful alignment—do you have any writing on this?
I think that your comment on (1) is too pessimistic about the possibility of stopping deployment of a misaligned model. That would be a pretty massive result! I think this would have cultural or policy benefits that are pretty diverse, so I don’t think I agree that this is always a losing strategy—after revealing misalignment to relevant actors and preventing deployment, the space of options grows a lot.
I’m not sure anything I wrote in (2) is close to understanding all the important safety properties of an AI. For example, the grokking work doesn’t explain all the inductive biases of transformers/Adam, but it has helped better reasoning about transformers/Adam. Is there something I’m missing?
On (3) I think that rewarding the correct mechanisms in models is basically an extension of process-based feedback. This may be infeasible or only be possible while applying lots of optimization pressure on a model’s cognition (which would be worrying for the various list of lethalities reasons). Are these the reasons you’re pessimistic about this, or something else?
I like your writing, but AFAIK you haven’t written your thoughts on interp prerequisites for useful alignment—do you have any writing on this?