Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don’t really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).
Interesting, perhaps this is driving our disagreement—I might just have higher standards than you for what counts as a warning shot. I was thinking that someone would have to die or millions of dollars would have to be lost. Because I was thinking warning shots were about “waking up” people who are insensitive to the evidence, rather than about providing evidence that there is a danger—I am pretty confident that evidence of danger will abound. Like, the boat race example is already evidence that AIs will be misaligned by default and that terrible things will happen if we deploy powerful unaligned AIs. But it’s not enough to wake most people up. I think it’ll help to have more and more examples like the boat race, with more and more capable and human-like AIs, but something that actually causes lots of harm would be substantially more effective. Anyhow, that’s what I think of when I think about warning shots—so maybe we don’t disagree that much after all.
Idk, I’m imagining “what would it take to get the people in power to care”, and it seems like the answer is:
For politicians, a consensus amongst experts + easy-to-understand high-level explanations of what can go wrong
For experts, a consensus amongst other experts (+ common knowledge of this consensus), or sufficiently compelling evidence, where what counts as “compelling” varies by expert
I agree that things that actually cause lots of harm would be substantially more effective at being compelling evidence, but I don’t think it’s necessary. When I evaluate whether something is a warning shot, I’m mostly thinking about “could this create consensus amongst experts”; I think things that are caught during training could certainly do that.
Like, the boat race example is already evidence that AIs will be misaligned by default and that terrible things will happen if we deploy powerful unaligned AIs.
It’s evidence, yes, but it’s hardly strong evidence. Many expert’s objections are “we won’t get to AGI in this paradigm”; I don’t think the boat race example is ~any evidence that we couldn’t have AIs with “common sense” in a different paradigm. In my experience, people who do think we’ll get to AGI in the current paradigm usually agree that misalignment would be really bad, such that they “agree with safety concerns” according to the definition here.
I also don’t think that it was particularly surprising to people who do work with RL. For example, from Alex Irpan’s post Deep RL Doesn’t Work Yet:
To be honest, I was a bit annoyed when [the boat racing example] first came out. This wasn’t because I thought it was making a bad point! It was because I thought the point it made was blindingly obvious. Of course reinforcement learning does weird things when the reward is misspecified! It felt like the post was making an unnecessarily large deal out of the given example.
Then I started writing this blog post, and realized the most compelling video of misspecified reward was the boat racing video. And since then, that video’s been used in several presentations bringing awareness to the problem. So, okay, I’ll begrudgingly admit this was a good blog post.
I feel like “warning shot” is a bad term for the thing that you’re pointing at, as I feel like a warning shot evokes a sense of actual harm/danger. Maybe a canary or a wake-up call or something?
Hmm, that might be better. Or perhaps I should not give it a name and just call it “evidence”, since that’s the broader category and I usually only care about the broad category and not specific subcategories.
Thanks for this explanation—I’m updating in your direction re what the appropriate definition of warning shots is (and thus the probability of warning shots), mostly because I’m defering to your judgment as someone who talks more regularly to more AI experts than I do.
Interesting, perhaps this is driving our disagreement—I might just have higher standards than you for what counts as a warning shot. I was thinking that someone would have to die or millions of dollars would have to be lost. Because I was thinking warning shots were about “waking up” people who are insensitive to the evidence, rather than about providing evidence that there is a danger—I am pretty confident that evidence of danger will abound. Like, the boat race example is already evidence that AIs will be misaligned by default and that terrible things will happen if we deploy powerful unaligned AIs. But it’s not enough to wake most people up. I think it’ll help to have more and more examples like the boat race, with more and more capable and human-like AIs, but something that actually causes lots of harm would be substantially more effective. Anyhow, that’s what I think of when I think about warning shots—so maybe we don’t disagree that much after all.
Idk, I’m imagining “what would it take to get the people in power to care”, and it seems like the answer is:
For politicians, a consensus amongst experts + easy-to-understand high-level explanations of what can go wrong
For experts, a consensus amongst other experts (+ common knowledge of this consensus), or sufficiently compelling evidence, where what counts as “compelling” varies by expert
I agree that things that actually cause lots of harm would be substantially more effective at being compelling evidence, but I don’t think it’s necessary. When I evaluate whether something is a warning shot, I’m mostly thinking about “could this create consensus amongst experts”; I think things that are caught during training could certainly do that.
It’s evidence, yes, but it’s hardly strong evidence. Many expert’s objections are “we won’t get to AGI in this paradigm”; I don’t think the boat race example is ~any evidence that we couldn’t have AIs with “common sense” in a different paradigm. In my experience, people who do think we’ll get to AGI in the current paradigm usually agree that misalignment would be really bad, such that they “agree with safety concerns” according to the definition here.
I also don’t think that it was particularly surprising to people who do work with RL. For example, from Alex Irpan’s post Deep RL Doesn’t Work Yet:
I feel like “warning shot” is a bad term for the thing that you’re pointing at, as I feel like a warning shot evokes a sense of actual harm/danger. Maybe a canary or a wake-up call or something?
Hmm, that might be better. Or perhaps I should not give it a name and just call it “evidence”, since that’s the broader category and I usually only care about the broad category and not specific subcategories.
Thanks for this explanation—I’m updating in your direction re what the appropriate definition of warning shots is (and thus the probability of warning shots), mostly because I’m defering to your judgment as someone who talks more regularly to more AI experts than I do.