(Some very rough thoughts I sent in DM, putting them up publicly on request for posterity, almost definitely not up to my epistemic standards for posting on LW).
So I think some confusion might come from connotations of word choices. I interpret adversarial robustness’ importance in terms of alignment targets, not properties (the two aren’t entirely different, but I think they aren’t exactly the same, and evoke different images in my mind). Like, the naive example here is just Goodharting on outer objectives that aren’t aligned at the limit, where optimization pressure is AGI powerful enough to achieve it at very late stages on a logarithmic curve, which runs into edge cases if you aren’t adversarially robust. So for outer alignment, you need a true name target. It’s worth noting that I think John considers a bulk of the alignment problem to be contained in outer alignment (or whatever the analogue is in your ontology of the problem), hence the focus on adversarial robustness—it’s not a term I hear very commonly apart from narrower contexts where its implication is obvious.
With inner alignment, I think it’s more confused, because adversarial robustness isn’t a term I would really use in that context. I have heard it used by others though—for example, someone I know is primarily working on designing training procedures that make adversarial robustness less of a problem (think debate). In that context I’m less certain about the inference to draw because it’s pretty far removed from my ontology, but my take would be that it removes problems where your training processes aren’t robust in holding to their training goal as things change, with scale, inner optimizers, etc. I think (although I’m far less sure of my memory here, this was a long conversation) he also mentioned it in the context of gradient hacking. So it’s used pretty broadly here if I’m remembering this correctly.
TL;DR: I’ve mainly heard it used in the context of outer alignment or the targets you want, which some people think is the bulk of the problem.
I can think of a bunch of different things that could feasibly fall under the term adversarial robustness in inner alignment as well (training processes robust to proxy-misaligned mesa-optimizers, processes robust to gradient hackers, etc), but it wouldn’t really feel intuitive to me, like you’re not asking questions framed the best way.
(Some very rough thoughts I sent in DM, putting them up publicly on request for posterity, almost definitely not up to my epistemic standards for posting on LW).
So I think some confusion might come from connotations of word choices. I interpret adversarial robustness’ importance in terms of alignment targets, not properties (the two aren’t entirely different, but I think they aren’t exactly the same, and evoke different images in my mind). Like, the naive example here is just Goodharting on outer objectives that aren’t aligned at the limit, where optimization pressure is AGI powerful enough to achieve it at very late stages on a logarithmic curve, which runs into edge cases if you aren’t adversarially robust. So for outer alignment, you need a true name target. It’s worth noting that I think John considers a bulk of the alignment problem to be contained in outer alignment (or whatever the analogue is in your ontology of the problem), hence the focus on adversarial robustness—it’s not a term I hear very commonly apart from narrower contexts where its implication is obvious.
With inner alignment, I think it’s more confused, because adversarial robustness isn’t a term I would really use in that context. I have heard it used by others though—for example, someone I know is primarily working on designing training procedures that make adversarial robustness less of a problem (think debate). In that context I’m less certain about the inference to draw because it’s pretty far removed from my ontology, but my take would be that it removes problems where your training processes aren’t robust in holding to their training goal as things change, with scale, inner optimizers, etc. I think (although I’m far less sure of my memory here, this was a long conversation) he also mentioned it in the context of gradient hacking. So it’s used pretty broadly here if I’m remembering this correctly.
TL;DR: I’ve mainly heard it used in the context of outer alignment or the targets you want, which some people think is the bulk of the problem.
I can think of a bunch of different things that could feasibly fall under the term adversarial robustness in inner alignment as well (training processes robust to proxy-misaligned mesa-optimizers, processes robust to gradient hackers, etc), but it wouldn’t really feel intuitive to me, like you’re not asking questions framed the best way.