This seems interesting, but I’ve seen no plausible case that there’s a version of (1) that’s both sufficient and achievable. I’ve seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don’t help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])
The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing], or [AI system keeps the world locked into some least-harmful path], depending on the setup—and here of course “least harmful” isn’t a utopia, since it’s a distribution of safety specifications, not desirability specifications). Am I mistaken about this?
I’m very pleased that people are thinking about this, but I fail to understand the optimism—hopefully I’m confused somewhere! Is anyone working on toy examples as proof of concept?
I worry that there’s so much deeply technical work here that not enough time is being spent to check that the concept is workable (is anyone focusing on this?). I’d suggest focusing on mental influences: what kind of specification would allow me to radically change my ideas, but not to be driven insane? What’s the basis to think we can find such a specification?
It seems to me that finding a fit-for-purpose safety/acceptability specification won’t be significantly easier than finding a specification for ambitious value alignment.
It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).
It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.
[again, the below is all in the spirit of “I think this direction is plausibly useful, and I’d like to see more work on it”]
not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world.
Sure, but this seems to say “Don’t worry, the malicious superintelligence can only manipulate your mind indirectly”. This is not the level of assurance I want from something calling itself “Guaranteed safe”.
It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details
This is one mechanism by which such a system could cause great downstream harm. Suppose that we have a process to avoid this. What assurance do we have that there aren’t other mechanisms to cause harm?
I don’t yet buy the description complexity penalty argument (as I currently understand it—but quite possibly I’m missing something). It’s possible to manipulate by strategically omitting information. Perhaps the “penalise heavily biased sampling” is intended to avoid this (??). If so, I’m not sure how this gets us more than a hand-waving argument. I imagine it’s very hard to do indirect manipulation without adding much complexity. I imagine that ASL-4+ systems are capable of many very hard things.
Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer—which I expect is untrue for any simple x. I can buy that there are simple properties whose reduction guarantees safety if it’s done to an extreme degree—but then I’m back to expecting the system to do nothing useful.
As an aside, I’d note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That’s not a criticism of the overall approach—I just want to highlight that I don’t think we get to have both [system provides helpful-in-ways-we-hadn’t-considered output] and [system can’t produce harmful output]. Allowing the former seems to allow the latter.
I would like to fund a sleeper-agents-style experiment on this by the end of 2025
That’s probably a good idea, but this kind of approach doesn’t seem in keeping with a “Guaranteed safe” label. More of a “We haven’t yet found a way in which this is unsafe”.
Paralysis of the form “AI system does nothing” is the most likely failure mode. This is a “de-pessimizing” agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.
“Locked into some least-harmful path” is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours.
(understood that you’d want to avoid the below by construction through the specification)
I think the worries about a “least harmful path” failure mode would also apply to a “below 1 catastrophic event per millennium” threshold. It’s not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn’t be highly undesirable outcomes.
It seems to me that “greatly penalize the additional facts which are enforced” is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn’t capture everything that we care about.
I haven’t thought about it in any detail, but doesn’t using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?
This seems interesting, but I’ve seen no plausible case that there’s a version of (1) that’s both sufficient and achievable. I’ve seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don’t help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])
The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing], or [AI system keeps the world locked into some least-harmful path], depending on the setup—and here of course “least harmful” isn’t a utopia, since it’s a distribution of safety specifications, not desirability specifications).
Am I mistaken about this?
I’m very pleased that people are thinking about this, but I fail to understand the optimism—hopefully I’m confused somewhere!
Is anyone working on toy examples as proof of concept?
I worry that there’s so much deeply technical work here that not enough time is being spent to check that the concept is workable (is anyone focusing on this?). I’d suggest focusing on mental influences: what kind of specification would allow me to radically change my ideas, but not to be driven insane? What’s the basis to think we can find such a specification?
It seems to me that finding a fit-for-purpose safety/acceptability specification won’t be significantly easier than finding a specification for ambitious value alignment.
It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).
It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.
[again, the below is all in the spirit of “I think this direction is plausibly useful, and I’d like to see more work on it”]
Sure, but this seems to say “Don’t worry, the malicious superintelligence can only manipulate your mind indirectly”. This is not the level of assurance I want from something calling itself “Guaranteed safe”.
This is one mechanism by which such a system could cause great downstream harm.
Suppose that we have a process to avoid this. What assurance do we have that there aren’t other mechanisms to cause harm?
I don’t yet buy the description complexity penalty argument (as I currently understand it—but quite possibly I’m missing something). It’s possible to manipulate by strategically omitting information. Perhaps the “penalise heavily biased sampling” is intended to avoid this (??). If so, I’m not sure how this gets us more than a hand-waving argument.
I imagine it’s very hard to do indirect manipulation without adding much complexity.
I imagine that ASL-4+ systems are capable of many very hard things.
Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer—which I expect is untrue for any simple x.
I can buy that there are simple properties whose reduction guarantees safety if it’s done to an extreme degree—but then I’m back to expecting the system to do nothing useful.
As an aside, I’d note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That’s not a criticism of the overall approach—I just want to highlight that I don’t think we get to have both [system provides helpful-in-ways-we-hadn’t-considered output] and [system can’t produce harmful output]. Allowing the former seems to allow the latter.
That’s probably a good idea, but this kind of approach doesn’t seem in keeping with a “Guaranteed safe” label. More of a “We haven’t yet found a way in which this is unsafe”.
Paralysis of the form “AI system does nothing” is the most likely failure mode. This is a “de-pessimizing” agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.
“Locked into some least-harmful path” is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours.
As a fail-safe, I propose to mitigate the downsides of lock-in by using time-bounded utility functions.
(understood that you’d want to avoid the below by construction through the specification)
I think the worries about a “least harmful path” failure mode would also apply to a “below 1 catastrophic event per millennium” threshold. It’s not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn’t be highly undesirable outcomes.
It seems to me that “greatly penalize the additional facts which are enforced” is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn’t capture everything that we care about.
I haven’t thought about it in any detail, but doesn’t using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?