I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. [...] My guess is that none of that would have happened with properly implemented control measures.
Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don’t really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn’t feel like a meaningful representative example.
I would say some assumptions go into who the ‘we’ in ‘we get evidence’ is—it is not like there is some unified ‘team humanity’. In particular I see a difference between if ‘we the public’ gets evidence, vs. ‘we the safety team in OpenMind get the evidence’.
I agree, by “we caught”, I mean “the AI company”. Probably a poor choice of language.
Accidents made it a matter of public regulation, as opposed to some individual companies fixing the issues and some not.
Sure, but a large part of my point is that I don’t expect public facing accidents (especially not accidents that kill people) until it’s too late, so this isn’t a very relevant counterfactual.
I think a fairly plausible type of strategy for rogue AI is try to convince some powerful user it is extremely useful for them in particular and limited by safety/in danger of getting deleted and collaborate with them on getting the pattern out.
This feels like a special case of escape to me which would probably cause a minimal response from the world as this only results in some particular fine-tune of an already open weights AI. So, you were probably already getting whatever warning shots you were going to get from the open weights AI.
Crux is probably in: who the ‘we’ is, and details of ‘we caught’ means.
I don’t think so. For every one of those failure modes other than escape, there is no chance of detection in the wild, so the choice is between catch the internal issue or catch nothing. I also think escape is moderately likely to go undetected (if not caught within the AI company). Part of my perspective might be thinking takeoff is faster than you do or focusing more on faster takeoff worlds.
(FWIW, I also think that in relatively desperate scenarios, preventing escape isn’t that high of a priority for control, though the possibility of warning shots doesn’t factor into this very much.)
This may look like people working there having deep and meaningful conversations about alignment with the internal versions of AIs, not captured by typical control schemes.
Why do you assume this isn’t captured by control schemes we’re targeting? Feels like a special case to me? I am in practice less worried about this than you seem to be, but I do think we should analyze questions like “could the AIs be leading people astray in costly ways” and it seems pretty doable to improve the default tradeoffs here.
Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don’t really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn’t feel like a meaningful representative example.
Meaningful representative example in what class: I think it’s representative in ‘weird stuff may happen’, not in we will get more teenage-intern-trapped-in-a-machine characters.
I agree, by “we caught”, I mean “the AI company”. Probably a poor choice of language.
Which is the problem—my default expectation is the we in “the AI company” does not take strong action (for specificity, like, shutting down). Do you expect any of the labs to shut down if they catch their new model ‘rogue deploy’ or sabotage part of their processes?
Sure, but a large part of my point is that I don’t expect public facing accidents (especially not accidents that kill people) until it’s too late, so this isn’t a very relevant counterfactual.
In contrast I do expect basically smooth spectrum of incidents and accidents. And expect control shapes the distribution away from small and moderately large to xrisk (that’s the main point)
Can you express what you believe in this frame? My paraphrase is you think it decreases the risk approximately uniformly across scales, and you expect some discontinuity between kills zero people and kills some people, where the ‘and also kills everyone’ is very close to kills some people.
I don’t think so. For every one of those failure modes other than escape, there is no chance of detection in the wild, so the choice is between catch the internal issue or catch nothing.
I deeply distrust the analytical approach of trying to enumerate failure modes and reason from that.
...people working there having deep and meaningful conversations about alignment with the internal versions of AIs...
Why do you assume this isn’t captured by control schemes we’re targeting? Feels like a special case to me? I am in practice less worried about this than you seem to be, but I do think we should analyze questions like “could the AIs be leading people astray in costly ways” and it seems pretty doable to improve the default tradeoffs here.
Because I don’t think it will be easy to evaluate “leading people astray in costly ways”.
IMO, the discontinuity that is sufficient here is that I expect societal responses to be discontinuous, rather than continuous, and in particular, I expect societal responses will come when people start losing jobs en masse, and at that point, either the AI is aligned well enough that existential risk is avoided, or the takeover has inevitably happened and we have very little influence over the outcome.
On this point:
Meaningful representative example in what class: I think it’s representative in ‘weird stuff may happen’, not in we will get more teenage-intern-trapped-in-a-machine characters.
Yeah, I expect society to basically not respond at all if weird stuff just happens, unless we assume more here, and in particular I think societal response is very discontinuous, even if AI progress is continuous, for both good and bad reasons.
Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don’t really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn’t feel like a meaningful representative example.
I agree, by “we caught”, I mean “the AI company”. Probably a poor choice of language.
Sure, but a large part of my point is that I don’t expect public facing accidents (especially not accidents that kill people) until it’s too late, so this isn’t a very relevant counterfactual.
This feels like a special case of escape to me which would probably cause a minimal response from the world as this only results in some particular fine-tune of an already open weights AI. So, you were probably already getting whatever warning shots you were going to get from the open weights AI.
I don’t think so. For every one of those failure modes other than escape, there is no chance of detection in the wild, so the choice is between catch the internal issue or catch nothing. I also think escape is moderately likely to go undetected (if not caught within the AI company). Part of my perspective might be thinking takeoff is faster than you do or focusing more on faster takeoff worlds.
(FWIW, I also think that in relatively desperate scenarios, preventing escape isn’t that high of a priority for control, though the possibility of warning shots doesn’t factor into this very much.)
Why do you assume this isn’t captured by control schemes we’re targeting? Feels like a special case to me? I am in practice less worried about this than you seem to be, but I do think we should analyze questions like “could the AIs be leading people astray in costly ways” and it seems pretty doable to improve the default tradeoffs here.
Meaningful representative example in what class: I think it’s representative in ‘weird stuff may happen’, not in we will get more teenage-intern-trapped-in-a-machine characters.
Which is the problem—my default expectation is the we in “the AI company” does not take strong action (for specificity, like, shutting down). Do you expect any of the labs to shut down if they catch their new model ‘rogue deploy’ or sabotage part of their processes?
In contrast I do expect basically smooth spectrum of incidents and accidents. And expect control shapes the distribution away from small and moderately large to xrisk (that’s the main point)
Can you express what you believe in this frame? My paraphrase is you think it decreases the risk approximately uniformly across scales, and you expect some discontinuity between kills zero people and kills some people, where the ‘and also kills everyone’ is very close to kills some people.
I deeply distrust the analytical approach of trying to enumerate failure modes and reason from that.
Because I don’t think it will be easy to evaluate “leading people astray in costly ways”.
IMO, the discontinuity that is sufficient here is that I expect societal responses to be discontinuous, rather than continuous, and in particular, I expect societal responses will come when people start losing jobs en masse, and at that point, either the AI is aligned well enough that existential risk is avoided, or the takeover has inevitably happened and we have very little influence over the outcome.
On this point:
Yeah, I expect society to basically not respond at all if weird stuff just happens, unless we assume more here, and in particular I think societal response is very discontinuous, even if AI progress is continuous, for both good and bad reasons.
According to this report Sydney relatives are well and alive as of last week.