I tried playing the game Nate suggested with myself. I think it updated me a bit more towards Holden’s view, though I’m very confident that if I did it with someone who was more expert than I am both the attacker and the defender would be more competent, and possibly the attacker would win.
Attacker: okay, let’s start with a classic: Alignment strategy of “kill all the capabilities researchers so networks are easier to interpret.” Defender: Arguments that this is a bad idea will obviously be in any training set that a human level AI researcher would be trained on. E.g. posts from this site.
Attacker: sure. But those arguments don’t address what an AI would be considering after many cycles of reflection. For example: it might observe that Alice endorses things like war where people are killed “for the greater good”, and a consistent extrapolation from these beliefs is that murder is acceptable. Defender: still pretty sure that the training corpus would include stuff about the state having a monopoly on violence, and any faithful attempt to replicate Alice would just clearly not have her murdering people? Like a next token prediction that has her writing a serial killer manifesto would get a ton of loss.
Attacker: it’s true that you probably wouldn’t predict that Alice would actively murder people, but you would predict that she would be okay with allowing people to die through her own inaction (standard “child drowning in a pond” thought experiment stuff). And just like genocides are bureaucratized such that no single individual feels responsible, the AI might come up with some system which doesn’t actually leave Alice feeling responsible for the capabilities researchers dying. (Meta point: when does something stop being POUDA? Like what if Alice’s CEV actually is to do something wild (in the opinion of current-Alice)? I think for the sake of this exercise we should not assume that Alice actually would want to do something wild if she knew/reflected more, but this might be ignoring an important threat vector?) Defender: I’m not sure exactly what this would look like, but I’m imagining something like “build a biological science company that has an opaque bureaucracy such that each person pursuing the goodwill somehow result in the collective creating a bio weapon that kills capabilities researchers” and this just seems really outside what you would expect Alice to do? I concede that there might not be anything in the training set which specifically prohibits this per se, but it just seems like a wild departure from Alice’s usual work of interpreting neurons.
(Meta: is this Nate’s point? Iterating reflection will inevitably take us wildly outside the training distribution so our models of what an AI attempting to replicate Alice would do are wildly off? Or is this a point for Holden: the only way we can get POUDA is by doing something that seems really implausible?)
I tried playing the game Nate suggested with myself. I think it updated me a bit more towards Holden’s view, though I’m very confident that if I did it with someone who was more expert than I am both the attacker and the defender would be more competent, and possibly the attacker would win.
Attacker: okay, let’s start with a classic: Alignment strategy of “kill all the capabilities researchers so networks are easier to interpret.”
Defender: Arguments that this is a bad idea will obviously be in any training set that a human level AI researcher would be trained on. E.g. posts from this site.
Attacker: sure. But those arguments don’t address what an AI would be considering after many cycles of reflection. For example: it might observe that Alice endorses things like war where people are killed “for the greater good”, and a consistent extrapolation from these beliefs is that murder is acceptable.
Defender: still pretty sure that the training corpus would include stuff about the state having a monopoly on violence, and any faithful attempt to replicate Alice would just clearly not have her murdering people? Like a next token prediction that has her writing a serial killer manifesto would get a ton of loss.
Attacker: it’s true that you probably wouldn’t predict that Alice would actively murder people, but you would predict that she would be okay with allowing people to die through her own inaction (standard “child drowning in a pond” thought experiment stuff). And just like genocides are bureaucratized such that no single individual feels responsible, the AI might come up with some system which doesn’t actually leave Alice feeling responsible for the capabilities researchers dying.
(Meta point: when does something stop being POUDA? Like what if Alice’s CEV actually is to do something wild (in the opinion of current-Alice)? I think for the sake of this exercise we should not assume that Alice actually would want to do something wild if she knew/reflected more, but this might be ignoring an important threat vector?)
Defender: I’m not sure exactly what this would look like, but I’m imagining something like “build a biological science company that has an opaque bureaucracy such that each person pursuing the goodwill somehow result in the collective creating a bio weapon that kills capabilities researchers” and this just seems really outside what you would expect Alice to do? I concede that there might not be anything in the training set which specifically prohibits this per se, but it just seems like a wild departure from Alice’s usual work of interpreting neurons.
(Meta: is this Nate’s point? Iterating reflection will inevitably take us wildly outside the training distribution so our models of what an AI attempting to replicate Alice would do are wildly off? Or is this a point for Holden: the only way we can get POUDA is by doing something that seems really implausible?)