Joe Collman comments on On “first critical tries” in AI alignment

Joe Collman 6 Jun 2024 5:23 UTC
LW: 13 AF: 4
0
AF
I think the DSA framing is in keeping with the spirit of “first critical try” discourse.
(With that in mind, the below is more “this too seems very important”, rather than “omitting this is an error”.)
However, I think it’s important to consider scenarios where humans lose meaningful control without any AI or group of AIs necessarily gaining a DSA. I think “loss of control” is the threat to think about, not “AI(s) take(s) control”. Admittedly this gets into Moloch-related grey areas—but this may indicate that [humans do/don’t have control] is too coarse-grained a framing.
I’d say that the key properties of “first critical try” are:
- We have the option to trigger some novel process.
- We’re unlikely to stop the process once it starts, even if it’s not going well.
  - Includes both [we can’t stop it] and [we won’t stop it].
- If the process goes badly, the odds of doom greatly increase.
- There’s a significant chance the process goes badly.
My guess is that the most likely near-term failure mode doesn’t start out as [some set of AIs gets a DSA], but rather [AI capability increase selects against meaningful human control] - and the DSA stuff is downstream of that.
This is a possibility with the [individually controllable powerful AI assistants] approach—whether or not this immediately takes things to transformational AI territory. Suppose we get the hoped-for >10x research speedup. Do we have a principled strategy for controlling the collective system this produces? I haven’t heard one. I wouldn’t say we’re doing a good job of controlling the current collective system.
I’ve heard cases for [this will speed things up], and [here are some good things this would make easier] but not for [overall, such a process should be expected to take things in a less doomy direction].
For such cases “you can’t learn enough from analogous but lower-stakes contexts” ought not to apply. However, I’d certainly expect “we won’t learn enough from analogous but lower-stakes contexts” (without huge efforts to avoid this).
- faul_sname 27 Jul 2024 5:29 UTC
  LW: 10 AF: 4
  4
  AF Parent
  Does any specific human or group of humans currently have “control” in the sense of “that which is lost in a loss-of-control scenario”? If not, that indicates to me that it may be useful to frame the risk as “failure to gain control”.
  - Joe Collman 29 Jul 2024 9:54 UTC
    LW: 6 AF: 3
    0
    AF Parent
    It may be better to think about it that way, yes—in some cases, at least.
    Probably it makes sense to throw in some more variables.
    Something like:
    To stand x chance of property p applying to system s, we’d need to apply resources r.
    In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].
    - faul_sname 29 Jul 2024 10:22 UTC
      5 points
      2
      Parent
      I think the most important part of your “To stand x chance of property p applying to system s, we’d need to apply resources r” model is the word “we”.
      
      Currently, there exists no “we” in the world that can ensure that nobody in the world does some form of research, or at least no “we” that can do that in a non-cataclysmic way. The International Atomic Energy Agency comes the closest of any group I’m aware of, but the scope is limited and also it does its thing mainly by controlling access to specific physical resources rather than by trying to prevent a bunch of people from doing a thing with resources they already possess.
      
      If “gain a DSA (or cause some trusted other group to gain a DSA) over everyone who could plausibly gain a DSA in the future” is a required part of your threat mitigation strategy, I am not optimistic about the chances for success but I’m even less optimistic about the chances of that working if you don’t realize that’s the game you’re trying to play.
      - Joe Collman 29 Jul 2024 10:53 UTC
        6 points
        0
        Parent
        I don’t think [gain a DSA] is the central path here.
        It’s much closer to [persuade some broad group that already has a lot of power collectively].
        I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
        But closer to: [add the property [will do the right thing] to [group that has DSA]].