I agree that some forms of robustness research don’t have capabilities externalities, but the unreliability of ML systems is a major blocker to many applications. So any robustness work that actually improves the robustness of practical ML systems is going to have “capabilities externalities” in the sense of making ML products more valuable.
I disagree even more strongly with “honesty efforts don’t have externalities:” AI systems confidently saying false statements is a major roadblock to lots of applications (e.g. any kind of deployment by Google), so this seems huge from a commercial perspective.
I agree that interpretability doesn’t always have big capabilities externalities, but it’s often far from zero. Work that sheds meaningful light on what models are actually doing internally seems particularly likely to have such externalities.
In general I think “solve problems that actually exist” is a big part of how the ML community is likely to make progress, and many kinds of safety progress will be addressing problems that people care about today and hence have this kind of capabilities externality.
The safety-capabilities ratio is the criterion of rightness
I think 1 unit of safety and 0 units of capabilities is worse than 10 units of safety and 1 unit of capabilities (where a unit is something taking similar labor to uncover), I think it’s more like (safety progress) - X * (timelines acceleration) for some tradeoff rate X.
Ultimately this seems like it should be a quantitative discussion, and so far from the safety community I’m just not seeing reasonable-looking botecs supporting an emphasis on capabilities externalities. I’m not a very ends-justify-the-means kind of person, but this seems like an application of deontology in a case where most of the good arguments for deontology don’t apply.
(It also feels like people are using “capabilities” to just mean “anything that makes AI more valuable in the short term,” which I think is a really fuzzy definition for which this argument is particularly inappropriate.)
rather than, say, “pursue particular general capabilities if you expect that it will help reduce risk down the line,”
I’m significantly more sympathetic to the argument that you shouldn’t scale up faster to bigger models (or improve capabilities in other ways) in order to be able to study safety issues sooner. Instead you should focus on the research that works well at the current scale, try to design experiments to detect problems faster, and prepare to do work on more capable models as they become available.
I think this is complicated and isn’t close to a slam dunk, but it’s my best guess and e.g. I find the countervailing arguments from OpenAI and Anthropic unpersuasive.
This is pretty similar to your line about avoiding capability externalities, but I’m more sympathetic to this version because:
The size of the “capability externality” there is significantly higher, since you are deliberately focusing on improving capabilities.
There’s a reasonable argument that you could just wait to do work that requires higher-capability models later. Most of the apparent value of doing that work today comes from picking low hanging fruit that can just as well be picked tomorrow (it’s still good to do earlier, but there is a systematic incentive to overvalue being the first person to pick the low hanging fruit). This contrasts with the the apparent proposal of just indefinitely avoiding work with capabilities externalities, which would mean you never pick the low hanging fruit in many areas.
For example, an alignment team’s InstructGPT efforts were instrumental in making ChatGPT arrive far earlier than it would have otherwise, which is causing Google to become substantially more competitive in AI and causing many billions to suddenly flow into different AGI efforts.
I think ChatGPT generated a lot of hype because of the way it was released (available to anyone in the public at OpenAI’s expense). I think Anthropic’s approach is a reasonable model for good behavior here—they trained an extremely similar conversational agent a long time ago, and continued to use it as a vehicle for doing research without generating ~any buzz at all as far as I can tell. That was a deliberate decision, despite the fact that they believed demonstrations would be great for their ability to raise money.
I think you should push back on the step where the lab deliberately generates hype to advantage themselves, not on the step where safety research helps make products more viable. The latter just doesn’t slow down progress very much, and comes with big costs. (It’s not obvious OpenAI deliberately generated hype here, but I think it is a sufficiently probable outcome that it’s clear they weren’t stressing about it much.)
In practice I think if 10% of researchers are focused on safety, and none of them worry at all about capabilities externalities, you should expect them to accelerate overall progress by <1%. Even an extra 10% of people doing capabilities work probably only speeds it up by say 3% (given crowding effects and the importance of compute), and safety work will have an even smaller effect since it’s not trying to speed things up. Obviously that’s just a rough a priori argument and you should look at the facts on the ground in any given case, but I feel like people discussing this in the safety community often don’t know the facts on the ground and have a tendency to overestimate the importance of research (and other actions) from this community.
Sorry, I am just now seeing since I’m on here irregularly.
So any robustness work that actually improves the robustness of practical ML systems is going to have “capabilities externalities” in the sense of making ML products more valuable.
Yes, though I do not equate general capabilities with making something more valuable. As written elsewhere,
It’s worth noting that safety is commercially valuable: systems viewed as safe are more likely to be deployed. As a result, even improving safety without improving capabilities could hasten the onset of x-risks. However, this is a very small effect compared with the effect of directly working on capabilities. In addition, hypersensitivity to any onset of x-risk proves too much. One could claim that any discussion of x-risk at all draws more attention to AI, which could hasten AI investment and the onset of x-risks. While this may be true, it is not a good reason to give up on safety or keep it known to only a select few. We should be precautious but not self-defeating.
I’m discussing “general capabilities externalities” rather than “any bad externality,” especially since the former is measurable and a dominant factor in AI development. (Identifying any sort of externality can lead people to say we should defund various useful safety efforts because it can lead to a “false sense of security,” which safety engineering reminds us this is not the right policy in any industry.)
I disagree even more strongly with “honesty efforts don’t have externalities:” AI systems confidently saying false statements is a major roadblock to lots of applications (e.g. any kind of deployment by Google), so this seems huge from a commercial perspective.
I distinguish between honesty and truthfulness; I think truthfulness was way too many externalities since it is too broad. For example, I think Collin et al.’s recent paper, an honesty paper, does not have general capabilities externalities. As written elsewhere,
Encouraging models to be truthful, when defined as not asserting a lie, may be desired to ensure that models do not willfully mislead their users. However, this may increase capabilities, since it encourages models to have better understanding of the world. In fact, maximally truth-seeking models would be more than fact-checking bots; they would be general research bots, which would likely be used for capabilities research. Truthfulness roughly combines three different goals: accuracy (having correct beliefs about the world), calibration (reporting beliefs with appropriate confidence levels), and honesty (reporting beliefs as they are internally represented). Calibration and honesty are safety goals, while accuracy is clearly a capability goal. This example demonstrates that in some cases, less pure safety goals such as truth can be decomposed into goals that are more safety-relevant and those that are more capabilities-relevant.
I agree that interpretability doesn’t always have big capabilities externalities, but it’s often far from zero.
To clarify, I cannot name a time a state-of-the-art model drew its accuracy-improving advancement from interpretability research. I think it hasn’t had a measurable performance impact, and anecdotally empirical researchers aren’t gaining insights from that the body of work which translate to accuracy improvements. It looks like a reliably beneficial research area.
It also feels like people are using “capabilities” to just mean “anything that makes AI more valuable in the short term,”
general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, …
These are extremely general instrumentally useful capabilities that improve intelligence. (Distinguish from models that are more honest, power averse, transparent, etc.) For example, ImageNet accuracy is the main general capabilities notion in vision, because it’s extremely correlated with downstream performance on so many things. Meanwhile, an improvement for adversarial robustness harms ImageNet accuracy and just improves adversarial robustness measures. If it so happened that adversarial robustness research became the best way to drive up ImageNet accuracy, then the capabilities community would flood in and work on it, and safety people should then instead work on other things.
Consequently what counts at safety should be informed by how the empirical results are looking, especially since empirical phenomena can be so unintuitive or hard to predict in deep learning.
In practice I think if 10% of researchers are focused on safety, and none of them worry at all about capabilities externalities, you should expect them to accelerate overall progress by <1%.
It looks to me that some of the highest value ideas come from safety folk. On my model there are some key things that are unusually concentrated among people concerned with AI safety, like any ability to actually visualize AGI, and to seek system designs more interesting than “stack more layers”.
Your early work on human feedback, extrapolated forward by others, seems like a prime example here, at least of a design idea that took off and is looking quite relevant to capabilities progress? And it continues to mostly be pushed forward by safety folk afaict.
I anticipate that the mechanistic interpretability folk may become another example of this, by inspiring and enabling other researchers to invent better architectures (e.g. https://arxiv.org/abs/2212.14052).
Maybe the RL with world models stuff (https://worldmodels.github.io/) is a counterexample, in which non-”safety” folk are trying successfully to push the envelope in a non-standard way. I think they might be in our orbit though.
I agree that safety people have lots of ideas more interesting than stack more layers, but they mostly seem irrelevant to progress. People working in AI capabilities also have plenty of such ideas, and one of the most surprising and persistent inefficiencies of the field is how consistently it overweights clever ideas relative to just spending the money to stack more layers. (I think this is largely down to sociological and institutional factors.)
Indeed, to the extent that AI safety people have plausibly accelerated AI capabilities I think it’s almost entirely by correcting that inefficiency faster than might have happened otherwise, especially via OpenAI’s training of GPT-3. But this isn’t a case of safety people incidentally benefiting capabilities as a byproduct of their work, it was a case of some people who care about safety deliberately doing something they thought would be a big capabilities advance. I think those are much more plausible as a source of acceleration!
(I would describe RLHF as pretty prototypical: “Don’t be clever, just stack layers and optimize the thing you care about.” I feel like people on LW are being overly mystical about it.)
tbc, I don’t feel very concerned by safety-focused folk who are off working on their own ideas. I think the more damaging things are (1) trying to garner prestige with leading labs and the AI field by trying to make transformative ideas work (which I think is a large factor in ongoing RLHF efforts?); and (2) trying to “wake up” the AI field into a state of doing much more varied stuff that “stack layers”
I agree that some forms of robustness research don’t have capabilities externalities, but the unreliability of ML systems is a major blocker to many applications. So any robustness work that actually improves the robustness of practical ML systems is going to have “capabilities externalities” in the sense of making ML products more valuable.
I disagree even more strongly with “honesty efforts don’t have externalities:” AI systems confidently saying false statements is a major roadblock to lots of applications (e.g. any kind of deployment by Google), so this seems huge from a commercial perspective.
I agree that interpretability doesn’t always have big capabilities externalities, but it’s often far from zero. Work that sheds meaningful light on what models are actually doing internally seems particularly likely to have such externalities.
In general I think “solve problems that actually exist” is a big part of how the ML community is likely to make progress, and many kinds of safety progress will be addressing problems that people care about today and hence have this kind of capabilities externality.
I think 1 unit of safety and 0 units of capabilities is worse than 10 units of safety and 1 unit of capabilities (where a unit is something taking similar labor to uncover), I think it’s more like (safety progress) - X * (timelines acceleration) for some tradeoff rate X.
Ultimately this seems like it should be a quantitative discussion, and so far from the safety community I’m just not seeing reasonable-looking botecs supporting an emphasis on capabilities externalities. I’m not a very ends-justify-the-means kind of person, but this seems like an application of deontology in a case where most of the good arguments for deontology don’t apply.
(It also feels like people are using “capabilities” to just mean “anything that makes AI more valuable in the short term,” which I think is a really fuzzy definition for which this argument is particularly inappropriate.)
I’m significantly more sympathetic to the argument that you shouldn’t scale up faster to bigger models (or improve capabilities in other ways) in order to be able to study safety issues sooner. Instead you should focus on the research that works well at the current scale, try to design experiments to detect problems faster, and prepare to do work on more capable models as they become available.
I think this is complicated and isn’t close to a slam dunk, but it’s my best guess and e.g. I find the countervailing arguments from OpenAI and Anthropic unpersuasive.
This is pretty similar to your line about avoiding capability externalities, but I’m more sympathetic to this version because:
The size of the “capability externality” there is significantly higher, since you are deliberately focusing on improving capabilities.
There’s a reasonable argument that you could just wait to do work that requires higher-capability models later. Most of the apparent value of doing that work today comes from picking low hanging fruit that can just as well be picked tomorrow (it’s still good to do earlier, but there is a systematic incentive to overvalue being the first person to pick the low hanging fruit). This contrasts with the the apparent proposal of just indefinitely avoiding work with capabilities externalities, which would mean you never pick the low hanging fruit in many areas.
I think ChatGPT generated a lot of hype because of the way it was released (available to anyone in the public at OpenAI’s expense). I think Anthropic’s approach is a reasonable model for good behavior here—they trained an extremely similar conversational agent a long time ago, and continued to use it as a vehicle for doing research without generating ~any buzz at all as far as I can tell. That was a deliberate decision, despite the fact that they believed demonstrations would be great for their ability to raise money.
I think you should push back on the step where the lab deliberately generates hype to advantage themselves, not on the step where safety research helps make products more viable. The latter just doesn’t slow down progress very much, and comes with big costs. (It’s not obvious OpenAI deliberately generated hype here, but I think it is a sufficiently probable outcome that it’s clear they weren’t stressing about it much.)
In practice I think if 10% of researchers are focused on safety, and none of them worry at all about capabilities externalities, you should expect them to accelerate overall progress by <1%. Even an extra 10% of people doing capabilities work probably only speeds it up by say 3% (given crowding effects and the importance of compute), and safety work will have an even smaller effect since it’s not trying to speed things up. Obviously that’s just a rough a priori argument and you should look at the facts on the ground in any given case, but I feel like people discussing this in the safety community often don’t know the facts on the ground and have a tendency to overestimate the importance of research (and other actions) from this community.
Sorry, I am just now seeing since I’m on here irregularly.
Yes, though I do not equate general capabilities with making something more valuable. As written elsewhere,
I’m discussing “general capabilities externalities” rather than “any bad externality,” especially since the former is measurable and a dominant factor in AI development. (Identifying any sort of externality can lead people to say we should defund various useful safety efforts because it can lead to a “false sense of security,” which safety engineering reminds us this is not the right policy in any industry.)
I distinguish between honesty and truthfulness; I think truthfulness was way too many externalities since it is too broad. For example, I think Collin et al.’s recent paper, an honesty paper, does not have general capabilities externalities. As written elsewhere,
To clarify, I cannot name a time a state-of-the-art model drew its accuracy-improving advancement from interpretability research. I think it hasn’t had a measurable performance impact, and anecdotally empirical researchers aren’t gaining insights from that the body of work which translate to accuracy improvements. It looks like a reliably beneficial research area.
I’m taking “general capabilities” to be something like
These are extremely general instrumentally useful capabilities that improve intelligence. (Distinguish from models that are more honest, power averse, transparent, etc.) For example, ImageNet accuracy is the main general capabilities notion in vision, because it’s extremely correlated with downstream performance on so many things. Meanwhile, an improvement for adversarial robustness harms ImageNet accuracy and just improves adversarial robustness measures. If it so happened that adversarial robustness research became the best way to drive up ImageNet accuracy, then the capabilities community would flood in and work on it, and safety people should then instead work on other things.
Consequently what counts at safety should be informed by how the empirical results are looking, especially since empirical phenomena can be so unintuitive or hard to predict in deep learning.
I agree with your other points, but on this one:
It looks to me that some of the highest value ideas come from safety folk. On my model there are some key things that are unusually concentrated among people concerned with AI safety, like any ability to actually visualize AGI, and to seek system designs more interesting than “stack more layers”.
Your early work on human feedback, extrapolated forward by others, seems like a prime example here, at least of a design idea that took off and is looking quite relevant to capabilities progress? And it continues to mostly be pushed forward by safety folk afaict.
I anticipate that the mechanistic interpretability folk may become another example of this, by inspiring and enabling other researchers to invent better architectures (e.g. https://arxiv.org/abs/2212.14052).
Maybe the RL with world models stuff (https://worldmodels.github.io/) is a counterexample, in which non-”safety” folk are trying successfully to push the envelope in a non-standard way. I think they might be in our orbit though.
I agree that safety people have lots of ideas more interesting than stack more layers, but they mostly seem irrelevant to progress. People working in AI capabilities also have plenty of such ideas, and one of the most surprising and persistent inefficiencies of the field is how consistently it overweights clever ideas relative to just spending the money to stack more layers. (I think this is largely down to sociological and institutional factors.)
Indeed, to the extent that AI safety people have plausibly accelerated AI capabilities I think it’s almost entirely by correcting that inefficiency faster than might have happened otherwise, especially via OpenAI’s training of GPT-3. But this isn’t a case of safety people incidentally benefiting capabilities as a byproduct of their work, it was a case of some people who care about safety deliberately doing something they thought would be a big capabilities advance. I think those are much more plausible as a source of acceleration!
(I would describe RLHF as pretty prototypical: “Don’t be clever, just stack layers and optimize the thing you care about.” I feel like people on LW are being overly mystical about it.)
tbc, I don’t feel very concerned by safety-focused folk who are off working on their own ideas. I think the more damaging things are (1) trying to garner prestige with leading labs and the AI field by trying to make transformative ideas work (which I think is a large factor in ongoing RLHF efforts?); and (2) trying to “wake up” the AI field into a state of doing much more varied stuff that “stack layers”