TurnTrout comments on Current AIs Provide Nearly No Data Relevant to AGI Alignment

TurnTrout 26 Dec 2023 19:07 UTC
LW: 27 AF: 13
23
AF
I disagree that it is actually “first-principles”. It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.
Yeah, but if you generalize from humans another way (“they tend not to destroy the world and tend to care about other humans”), you’ll come to a wildly different conclusion. The conclusion should not be sensitive to poorly motivated reference classes and frames, unless it’s really clear why we’re using one frame. This is a huge peril of reasoning by analogy.
Whenever attempting to draw conclusions by analogy, it’s important that there be shared causal mechanisms which produce the outcome of interest. For example, I can simulate a spring using an analog computer because both systems are roughly governed by similar differential equations. In shard theory, I posited that there’s a shared mechanism of “local updating via self-supervised and TD learning on ~randomly initialized neural networks” which leads to things like “contextually activated heuristics” (or “shards”).
Here, it isn’t clear what the shared mechanism is supposed to be, such that both (future) AI and humans have it. Suppose I grant that if a system is “smart” and has “goals”, then bad things can happen. Let’s call that the “bad agency” hypothesis.
But how do we know that future AI will have the relevant cognitive structures for “bad agency” to be satisfied? How do we know that the AI will have internal goal representations which chain into each other across contexts, so that the AI reliably pursues one or more goals over time? How do we know that the mechanisms are similar enough for the human->AI analogy to provide meaningful evidence on this particular question?
I expect there to be “bad agency” systems eventually, but it really matters what kind we’re talking about. If you’re thinking of “secret deceptive alignment that never externalizes in the chain-of-thought” and I’m thinking about “scaffolded models prompted to be agentic and bad”, then our interventions will be wildly different.
What links here?
- Many arguments for AI x-risk are wrong by TurnTrout (5 Mar 2024 2:31 UTC; 167 points)
- Thane Ruthenis 26 Dec 2023 21:15 UTC
  LW: 5 AF: 4
  0
  AF Parent
  Yeah, but if you generalize from humans another way (“they tend not to destroy the world and tend to care about other humans”), you’ll come to a wildly different conclusion
  Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That’s not the main issue.
  Here’s how the whole situation looks like from my perspective:
  - We don’t know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
  - Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values.
  - There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it’s not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities.
  - We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI.
  - SOTA AIs are, nevertheless, superhuman at some tasks at which we’ve managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they’d plausibly wipe out whole industries.
  - An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat.
  - The AI industry leaders are purposefully trying to build a generally-intelligent AI.
  - The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it’s not going to give their model room to develop deceptive alignment and other human-like issues.
  - Summing up: There’s reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is currently trying to blindly-but-purposefully wander in the direction of AGI.
    Even shorter: There’s a plausible case that, on its current course, the AI industry is going to generate an extinction-capable AI model.
    There are no ironclad arguments against that, unless you buy into your inside-view model of generally-intelligent cognition as hard as I buy into mine.
  - And what you effectively seem to be saying is “until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures”.
  - What I’m saying is “until you can rigorously prove that a given scale-up plus architectural tweak isn’t going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically”.
  Yes, “prove that this technological advance isn’t going to kill us all or you’re not allowed to do it” is a ridiculous standard to apply in the general case. But in this one case, there’s a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.
  - TurnTrout 1 Jan 2024 20:46 UTC
    LW: 8 AF: 6
    5
    AF Parent
    And what you effectively seem to be saying is “until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures”.
    No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that’s important. I think it’s important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn’t mean it’s fine and dandy to keep scaling with no concern at all.
    The reason my percentage is “only 5 to 15” is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.
    (Hopefully this comment of mine clarifies; it feels kinda vague to me.)
    What I’m saying is “until you can rigorously prove that a given scale-up plus architectural tweak isn’t going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically”.
    But I do think this is way too high of a bar.
    - Thane Ruthenis 4 Jan 2024 9:45 UTC
      LW: 4 AF: 4
      0
      AF Parent
      No, I am in fact quite worried about the situation
      Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.
      I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures
      Would you outline your full argument for this and the reasoning/evidence backing that argument?
      To restate: My claim is that, no matter much empirical evidence we have regarding LLMs’ internals, until we have either an AGI we’ve empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).
      Would you disagree? If yes, how so?