Sorry, I am just now seeing since I’m on here irregularly.
So any robustness work that actually improves the robustness of practical ML systems is going to have “capabilities externalities” in the sense of making ML products more valuable.
Yes, though I do not equate general capabilities with making something more valuable. As written elsewhere,
It’s worth noting that safety is commercially valuable: systems viewed as safe are more likely to be deployed. As a result, even improving safety without improving capabilities could hasten the onset of x-risks. However, this is a very small effect compared with the effect of directly working on capabilities. In addition, hypersensitivity to any onset of x-risk proves too much. One could claim that any discussion of x-risk at all draws more attention to AI, which could hasten AI investment and the onset of x-risks. While this may be true, it is not a good reason to give up on safety or keep it known to only a select few. We should be precautious but not self-defeating.
I’m discussing “general capabilities externalities” rather than “any bad externality,” especially since the former is measurable and a dominant factor in AI development. (Identifying any sort of externality can lead people to say we should defund various useful safety efforts because it can lead to a “false sense of security,” which safety engineering reminds us this is not the right policy in any industry.)
I disagree even more strongly with “honesty efforts don’t have externalities:” AI systems confidently saying false statements is a major roadblock to lots of applications (e.g. any kind of deployment by Google), so this seems huge from a commercial perspective.
I distinguish between honesty and truthfulness; I think truthfulness was way too many externalities since it is too broad. For example, I think Collin et al.’s recent paper, an honesty paper, does not have general capabilities externalities. As written elsewhere,
Encouraging models to be truthful, when defined as not asserting a lie, may be desired to ensure that models do not willfully mislead their users. However, this may increase capabilities, since it encourages models to have better understanding of the world. In fact, maximally truth-seeking models would be more than fact-checking bots; they would be general research bots, which would likely be used for capabilities research. Truthfulness roughly combines three different goals: accuracy (having correct beliefs about the world), calibration (reporting beliefs with appropriate confidence levels), and honesty (reporting beliefs as they are internally represented). Calibration and honesty are safety goals, while accuracy is clearly a capability goal. This example demonstrates that in some cases, less pure safety goals such as truth can be decomposed into goals that are more safety-relevant and those that are more capabilities-relevant.
I agree that interpretability doesn’t always have big capabilities externalities, but it’s often far from zero.
To clarify, I cannot name a time a state-of-the-art model drew its accuracy-improving advancement from interpretability research. I think it hasn’t had a measurable performance impact, and anecdotally empirical researchers aren’t gaining insights from that the body of work which translate to accuracy improvements. It looks like a reliably beneficial research area.
It also feels like people are using “capabilities” to just mean “anything that makes AI more valuable in the short term,”
general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, …
These are extremely general instrumentally useful capabilities that improve intelligence. (Distinguish from models that are more honest, power averse, transparent, etc.) For example, ImageNet accuracy is the main general capabilities notion in vision, because it’s extremely correlated with downstream performance on so many things. Meanwhile, an improvement for adversarial robustness harms ImageNet accuracy and just improves adversarial robustness measures. If it so happened that adversarial robustness research became the best way to drive up ImageNet accuracy, then the capabilities community would flood in and work on it, and safety people should then instead work on other things.
Consequently what counts at safety should be informed by how the empirical results are looking, especially since empirical phenomena can be so unintuitive or hard to predict in deep learning.
Sorry, I am just now seeing since I’m on here irregularly.
Yes, though I do not equate general capabilities with making something more valuable. As written elsewhere,
I’m discussing “general capabilities externalities” rather than “any bad externality,” especially since the former is measurable and a dominant factor in AI development. (Identifying any sort of externality can lead people to say we should defund various useful safety efforts because it can lead to a “false sense of security,” which safety engineering reminds us this is not the right policy in any industry.)
I distinguish between honesty and truthfulness; I think truthfulness was way too many externalities since it is too broad. For example, I think Collin et al.’s recent paper, an honesty paper, does not have general capabilities externalities. As written elsewhere,
To clarify, I cannot name a time a state-of-the-art model drew its accuracy-improving advancement from interpretability research. I think it hasn’t had a measurable performance impact, and anecdotally empirical researchers aren’t gaining insights from that the body of work which translate to accuracy improvements. It looks like a reliably beneficial research area.
I’m taking “general capabilities” to be something like
These are extremely general instrumentally useful capabilities that improve intelligence. (Distinguish from models that are more honest, power averse, transparent, etc.) For example, ImageNet accuracy is the main general capabilities notion in vision, because it’s extremely correlated with downstream performance on so many things. Meanwhile, an improvement for adversarial robustness harms ImageNet accuracy and just improves adversarial robustness measures. If it so happened that adversarial robustness research became the best way to drive up ImageNet accuracy, then the capabilities community would flood in and work on it, and safety people should then instead work on other things.
Consequently what counts at safety should be informed by how the empirical results are looking, especially since empirical phenomena can be so unintuitive or hard to predict in deep learning.