Vaniver comments on My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”

Vaniver 23 Mar 2023 20:22 UTC
LW: 5 AF: 1
3
AF
Then the model can safely scale.
If there are experiences which will change itself which don’t lead to less of the initial good values, then yeah, for an approximate definition of safety. You’re resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don’t really see your description as, like, a specific alignment strategy so much as the strategy of “have an alignment strategy at all”. The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
- anithite 24 Mar 2023 22:10 UTC
  2 points
  1
  Parent
  if it fails before you top out the scaling I think you probably lose
  While I agree that arbitrary scaling is dangerous, stopping early is an option. Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
  The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
  The alignment strategy seems to be “what we’re doing right now” which is:
  - feed the base model human generated training data
  - apply RL type stuff (RLHF,RLAIF,etc.) to reinforce the good type of internet learned behavior patterns
  This could definitely fail eventually if RLAIF style self-improvement is allowed to go on long enough but crucially, especially with RLAIF and other strategies that set the AI to training itself, there’s a scalable mostly aligned intelligence right there that can help. We’re not trying to safely align a demon so much as avoid getting to “demon” from a the somewhat aligned thing we have now.
  - Eli Tyre 11 Apr 2024 7:32 UTC
    2 points
    0
    Parent
    Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
    How much is this central to your story of how things go well?
    
    I agree that humanity could do this (or at least it could if it had it’s shit together), and I think it’s a good target to aim for that buys us sizable successes probability. But I don’t think it’s what’s going to happen by default.
    - anithite 12 Apr 2024 23:06 UTC
      1 point
      0
      Parent
      Slower is better obviously but as to the inevitability of ASI, I think reaching top 99% human capabilities in a handful of domains is enough to stop the current race. Getting there is probably not too dangerous.
      - Eli Tyre 13 Apr 2024 7:05 UTC
        4 points
        0
        Parent
        Stop it how?
        anithite 17 Apr 2024 4:54 UTC
        1 point
        0
        Parent
        Vulnerable world hypothesis (but takeover risk rather than destruction risk). That + first mover advantage could stop things pretty decisively without requiring ASI alignment
        
        As an example, taking over most networked computing devices seems feasible in principle with thousands of +2SD AI programmers/security-researchers. That requires an Alpha-go level breakthrough for RL as applied to LLM programmer-agents.
        
        One especially low risk/complexity option is a stealthy takeover of other AI lab’s compute then faking another AI winter. This might get you most of the compute and impact you care about without actively pissing off everyone.
        
        If more confident in jailbreak prevention and software hardening, secrecy is less important.
        
        First mover advantage depends on ability to fix vulnerabilities and harden infrastructure to prevent a second group from taking over. To the extent AI is required for management, jailbreak prevention/mitigation will also be needed.