(1) might work, but seems like a bad reason to exert a lot of effort. If we’re in a game state where people are building dangerous AIs, stopping one such person so that they can re-try to build a non-dangerous AI (and hoping nobody else builds a dangerous AI in the meantime) is not really a strategy that works. Unless by sheer luck we’re right at the middle of the logistic success curve, and we have a near-50/50 shot of getting a good AI on the next try.
(2) veers dangerously close to what I call “understanding-based safety,” which is the idea that it’s practical for humans to understand all the important safety properties of an AI, and once they do they’ll be able to modify that AI to make it safe. I think this intuition is wrong. Understanding all the relevant properties is very unlikely even with lots more resources poured into interpretability, and there’s too much handwaving about turning this understanding into safety.
This is also the sort of interpretability that’s most useful for capabilities (in ways independent of how useful it is for alignment), though different people will have different cost-benefits here.
(3) is definitely interesting, but it’s not a way that interpretability actually helps with alignment.
I actually do think interpretability can be a prerequisite for useful alignment technologies! I just think this post represents one part of the “standard view” on interpretability that I on-balance disagree with.
I think that your comment on (1) is too pessimistic about the possibility of stopping deployment of a misaligned model. That would be a pretty massive result! I think this would have cultural or policy benefits that are pretty diverse, so I don’t think I agree that this is always a losing strategy—after revealing misalignment to relevant actors and preventing deployment, the space of options grows a lot.
I’m not sure anything I wrote in (2) is close to understanding all the important safety properties of an AI. For example, the grokking work doesn’t explain all the inductive biases of transformers/Adam, but it has helped better reasoning about transformers/Adam. Is there something I’m missing?
On (3) I think that rewarding the correct mechanisms in models is basically an extension of process-based feedback. This may be infeasible or only be possible while applying lots of optimization pressure on a model’s cognition (which would be worrying for the various list of lethalities reasons). Are these the reasons you’re pessimistic about this, or something else?
I like your writing, but AFAIK you haven’t written your thoughts on interp prerequisites for useful alignment—do you have any writing on this?
(1) might work, but seems like a bad reason to exert a lot of effort. If we’re in a game state where people are building dangerous AIs, stopping one such person so that they can re-try to build a non-dangerous AI (and hoping nobody else builds a dangerous AI in the meantime) is not really a strategy that works. Unless by sheer luck we’re right at the middle of the logistic success curve, and we have a near-50/50 shot of getting a good AI on the next try.
(2) veers dangerously close to what I call “understanding-based safety,” which is the idea that it’s practical for humans to understand all the important safety properties of an AI, and once they do they’ll be able to modify that AI to make it safe. I think this intuition is wrong. Understanding all the relevant properties is very unlikely even with lots more resources poured into interpretability, and there’s too much handwaving about turning this understanding into safety.
This is also the sort of interpretability that’s most useful for capabilities (in ways independent of how useful it is for alignment), though different people will have different cost-benefits here.
(3) is definitely interesting, but it’s not a way that interpretability actually helps with alignment.
I actually do think interpretability can be a prerequisite for useful alignment technologies! I just think this post represents one part of the “standard view” on interpretability that I on-balance disagree with.
I think that your comment on (1) is too pessimistic about the possibility of stopping deployment of a misaligned model. That would be a pretty massive result! I think this would have cultural or policy benefits that are pretty diverse, so I don’t think I agree that this is always a losing strategy—after revealing misalignment to relevant actors and preventing deployment, the space of options grows a lot.
I’m not sure anything I wrote in (2) is close to understanding all the important safety properties of an AI. For example, the grokking work doesn’t explain all the inductive biases of transformers/Adam, but it has helped better reasoning about transformers/Adam. Is there something I’m missing?
On (3) I think that rewarding the correct mechanisms in models is basically an extension of process-based feedback. This may be infeasible or only be possible while applying lots of optimization pressure on a model’s cognition (which would be worrying for the various list of lethalities reasons). Are these the reasons you’re pessimistic about this, or something else?
I like your writing, but AFAIK you haven’t written your thoughts on interp prerequisites for useful alignment—do you have any writing on this?