There is a disheartening irony to calling this series “Practical AI Safety” and having the longest post being about capabilities advancements which largely ignore safety.
The first part of this post consists in observing that ML applications proceed from metrics, and subsequently arguing that theoretical approaches have been unsuccessful in learning problems. This is true but irrelevant for safety, unless your proposal is to apply ML to safety problems, which reduces AI Safety to ‘just find good metrics for safe behaviour’. This seems as far from a pragmatic understanding of what is needed in AI Safety as one can get.
In the process of dismissing theoretical approaches, you ask “Why do residual connections work? Why does fractal data augmentation help?” These are exactly the kind of questions which we need to be building theory for, not to improve performance, but for humans to understand what is happening well enough to identify potential risks orthogonal to the benchmarks which such techniques are improving against, or trust that such risks are not present.
You say, “If we want to have any hope of influencing the ML community broadly, we need to understand how it works (and sometimes doesn’t work) at a high level,” and provide similar prefaces as motivation in other sections. I find these claims credible, assuming the “we” refers to AI Safety researchers, but considering the alleged pragmatism of this sequence, it’s surprising to me that none of the claims are followed up with suggested action points. Given the information you have provided, how can we influence this community? By publishing ML papers at NeurIPS? And to what end are you hoping to influence them? AI Safety can attract attention, but attention alone doesn’t translate into progress (or even into more person-hours).
Your disdain for theoretical approaches is transparent here (if it wasn’t already from the name of this sequence). But your reasoning cuts both ways. You say, “Even if the current paradigm is flawed and a new paradigm is needed, this does not mean that [a researcher’s] favorite paradigm will become that new paradigm. They cannot ignore or bargain with the paradigm that will actually work; they must align with it.” I expect that ‘metrics suffice’, (a strawperson of) your favoured paradigm, will not be the paradigm that will actually work, and it’s disappointing that your sequence carries the message (to my reading) that technical ML researchers can make significant progress in alignment and safety without really changing what they’re doing.
This is what I was trying to question with my comment above: Why do you think this? How am I to use this information? It’s surely true that this is a community that needs to be convinced of the importance of work on safety, as you point out in the next post in the sequence, but how does information about, say, the turnover of ML PhD students help me do that?
There is conflation happening here which undermines your argument: theoretical approaches dominated how machine learning systems were shaped for decades, and you say so at the start of this post. It turned out that automated learning produced better results in terms of capabilities, and it is that success that makes it the continued default. But the former fact surely says a lot more about whether or not theory can “shape machine learning systems” than the latter. Following through with your argument, I would instead conclude that implementing theoretical approaches to safety might require us to compromise on capabilities, and this is indeed exactly what I expect: learning systems would have access to much more delicious data if they ignored privacy regulations and other similar ethical boundaries, but safety demands that capability is not the singular shaping consideration in AI systems.
This is simply not true. Failure modes which were identified by purely theoretical arguments have been realised in ML systems. System attacks and pathological behaviour (for image classifiers, say) are regularly built in theory before they ever meet real systems. It’s also worth noting that any architecture choices or to, say, make backprop more algorithmically efficient, are driven by theory.
In the end, my attitude is not that “iterative engineering practices will never ensure safety”, but rather that there are plenty of people already doing iterative engineering, and that while it’s great to convince as many of those as possible to be safety-conscious, there would be further benefits to safety if some of their experience could be applied to the theoretical approaches that you’re actively dismissing.