It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).
It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.
[again, the below is all in the spirit of “I think this direction is plausibly useful, and I’d like to see more work on it”]
not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world.
Sure, but this seems to say “Don’t worry, the malicious superintelligence can only manipulate your mind indirectly”. This is not the level of assurance I want from something calling itself “Guaranteed safe”.
It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details
This is one mechanism by which such a system could cause great downstream harm. Suppose that we have a process to avoid this. What assurance do we have that there aren’t other mechanisms to cause harm?
I don’t yet buy the description complexity penalty argument (as I currently understand it—but quite possibly I’m missing something). It’s possible to manipulate by strategically omitting information. Perhaps the “penalise heavily biased sampling” is intended to avoid this (??). If so, I’m not sure how this gets us more than a hand-waving argument. I imagine it’s very hard to do indirect manipulation without adding much complexity. I imagine that ASL-4+ systems are capable of many very hard things.
Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer—which I expect is untrue for any simple x. I can buy that there are simple properties whose reduction guarantees safety if it’s done to an extreme degree—but then I’m back to expecting the system to do nothing useful.
As an aside, I’d note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That’s not a criticism of the overall approach—I just want to highlight that I don’t think we get to have both [system provides helpful-in-ways-we-hadn’t-considered output] and [system can’t produce harmful output]. Allowing the former seems to allow the latter.
I would like to fund a sleeper-agents-style experiment on this by the end of 2025
That’s probably a good idea, but this kind of approach doesn’t seem in keeping with a “Guaranteed safe” label. More of a “We haven’t yet found a way in which this is unsafe”.
It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).
It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.
[again, the below is all in the spirit of “I think this direction is plausibly useful, and I’d like to see more work on it”]
Sure, but this seems to say “Don’t worry, the malicious superintelligence can only manipulate your mind indirectly”. This is not the level of assurance I want from something calling itself “Guaranteed safe”.
This is one mechanism by which such a system could cause great downstream harm.
Suppose that we have a process to avoid this. What assurance do we have that there aren’t other mechanisms to cause harm?
I don’t yet buy the description complexity penalty argument (as I currently understand it—but quite possibly I’m missing something). It’s possible to manipulate by strategically omitting information. Perhaps the “penalise heavily biased sampling” is intended to avoid this (??). If so, I’m not sure how this gets us more than a hand-waving argument.
I imagine it’s very hard to do indirect manipulation without adding much complexity.
I imagine that ASL-4+ systems are capable of many very hard things.
Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer—which I expect is untrue for any simple x.
I can buy that there are simple properties whose reduction guarantees safety if it’s done to an extreme degree—but then I’m back to expecting the system to do nothing useful.
As an aside, I’d note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That’s not a criticism of the overall approach—I just want to highlight that I don’t think we get to have both [system provides helpful-in-ways-we-hadn’t-considered output] and [system can’t produce harmful output]. Allowing the former seems to allow the latter.
That’s probably a good idea, but this kind of approach doesn’t seem in keeping with a “Guaranteed safe” label. More of a “We haven’t yet found a way in which this is unsafe”.