I think we will probably pass through a point where an alignment failure could be catastrophic but not existentially catastrophic.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic (both deceptive alignment and irreversible reward hacking are noticeably harder to fix once an AI coup can succeed). I expect it will be possible to create toy models of alignment failures, and that you’ll get at least some kind of warning shot, but that you may not actually see any giant warning shots.
I think AI used for hacking or even to make a self-replicating worm is likely to happen before the end of days, but I don’t know how people would react to that. I expect it will be characterized as misuse, that the proposed solution will be “don’t use AI for bad stuff, stop your customers from doing so, provide inference as a service and monitor for this kind of abuse,” and that we’ll read a lot of headlines about how the real problem wasn’t the terminator but just humans doing bad things.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic
Agreed. My update is coming purely from increasing my estimation for how much press and therefore funding AI risk is going to get long before to that point. 12 months ago it seemed to me that capabilities had increased dramatically, and yet there was no proportional increase in the general public’s level of fear of catastrophe. Now it seems to me that there’s a more plausible path to widespread appreciation of (and therefore work on) AI risk. To be clear though, I’m just updating that it’s less likely we’ll fail because we didn’t seriously try to find a solution, not that I have new evidence of a tractable solution.
I don’t know how people would react to that.
I think there are some quite plausibly terrifying non-existential incidents at the severe end of the spectrum. Without spending time brainstorming infohazards, Stuart Russel’s slaughterbots come to mind. I think it’s an interesting (and probably important) question as to how bad an incident would have to be to produce a meaningful response.
I expect it will be characterized as misuse, that the proposed solution will be “don’t use AI for bad stuff,
Here’s where I disagree (at least, the apparent confidence). Looking at the pushback that Galactica got, the opposite conclusion seems more plausible to me, that before too long we get actual restrictions that bite when using AI for good stuff, let alone for bad stuff. For example, consider the tone of this MIT Technology Review article:
This is for a demo of a LLM that has not harmed anyone, merely made some mildly offensive utterances. Imagine what the NYT will write when an AI from Big Tech is shown to have actually harmed someone (let alone kill someone). It will be a political bloodbath.
Anyway, I think the interesting part for this community is that it points to some socio-political approaches that could be emphasized to increase funding and researcher pool (and therefore research velocity), rather than the typical purely-technical explorations of AI safety that are posted here.
“Someone automated finding SQL injection exploits with google and a simple script” and “Someone found a zero-day by using chatGPT” doesn’t seem qualitatively different to the average human being. I think they just file it under “someone used coding to hack computers” and move on with their day. Headlines are going to be based on the impact of a hack, not how spooky the tech used to do it is.
I think we will probably pass through a point where an alignment failure could be catastrophic but not existentially catastrophic.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic (both deceptive alignment and irreversible reward hacking are noticeably harder to fix once an AI coup can succeed). I expect it will be possible to create toy models of alignment failures, and that you’ll get at least some kind of warning shot, but that you may not actually see any giant warning shots.
I think AI used for hacking or even to make a self-replicating worm is likely to happen before the end of days, but I don’t know how people would react to that. I expect it will be characterized as misuse, that the proposed solution will be “don’t use AI for bad stuff, stop your customers from doing so, provide inference as a service and monitor for this kind of abuse,” and that we’ll read a lot of headlines about how the real problem wasn’t the terminator but just humans doing bad things.
Agreed. My update is coming purely from increasing my estimation for how much press and therefore funding AI risk is going to get long before to that point. 12 months ago it seemed to me that capabilities had increased dramatically, and yet there was no proportional increase in the general public’s level of fear of catastrophe. Now it seems to me that there’s a more plausible path to widespread appreciation of (and therefore work on) AI risk. To be clear though, I’m just updating that it’s less likely we’ll fail because we didn’t seriously try to find a solution, not that I have new evidence of a tractable solution.
I think there are some quite plausibly terrifying non-existential incidents at the severe end of the spectrum. Without spending time brainstorming infohazards, Stuart Russel’s slaughterbots come to mind. I think it’s an interesting (and probably important) question as to how bad an incident would have to be to produce a meaningful response.
Here’s where I disagree (at least, the apparent confidence). Looking at the pushback that Galactica got, the opposite conclusion seems more plausible to me, that before too long we get actual restrictions that bite when using AI for good stuff, let alone for bad stuff. For example, consider the tone of this MIT Technology Review article:
This is for a demo of a LLM that has not harmed anyone, merely made some mildly offensive utterances. Imagine what the NYT will write when an AI from Big Tech is shown to have actually harmed someone (let alone kill someone). It will be a political bloodbath.
Anyway, I think the interesting part for this community is that it points to some socio-political approaches that could be emphasized to increase funding and researcher pool (and therefore research velocity), rather than the typical purely-technical explorations of AI safety that are posted here.
“Someone automated finding SQL injection exploits with google and a simple script” and “Someone found a zero-day by using chatGPT” doesn’t seem qualitatively different to the average human being. I think they just file it under “someone used coding to hack computers” and move on with their day. Headlines are going to be based on the impact of a hack, not how spooky the tech used to do it is.