Raemon comments on Self-fulfilling misalignment data might be poisoning our AI models

Raemon 3 Mar 2025 23:34 UTC
LW: 13 AF: 6
3
AF
My current guess is:
1. This is more relevant for up-to-the first couple generations of “just barely superintelligent” AIs.
2. I don’t really expect it to be the deciding factor after many iterations of end-to-end RSI that gets you to the “able to generate novel scientific or engineering insights much faster than a human or institution could.”

I do think it’s plausible that the initial bias towards “evil/hackery AI” could start it off in a bad basin of attraction, but a) even if you completely avoided that, I would still basically expect this to rediscover this on it’s own as it gained superhuman levels of competence, b) one of the things I most want to use a slightly-superhuman AI to do is to robustly align massively superhuman AI, and I don’t really see how to do that without directly engaging with the knowledge of the failure modes there.

I think there are other plans that route more though “use STEM AI to build an uploader or bioenhancer, and then have an accelerated human-psyche do the technical philosophy necessary to handle the unbounded alignment case. I could see that being the right call, and I could imagine the bias from the “already knows about deceptive alignment etc” being large-magnitude enough to matter in the initial process. [edit: In those cases I’d probably want to filter out a lot more than just “unfriendly AI strategies”]
But, basically, how this applies depends on what it is you’re trying to do with the AI, and what stage/flavor of AI you’re working with and how it’s helping.
- Knight Lee 4 Mar 2025 21:33 UTC
  5 points
  2
  Parent
  People are still very unsure/fuzzy about What goals will AIs have, i.e. what actually contributes the an AGI’s final goals, so there is still a risk this influences it.
  I agree that using one AI to align another AI requires it to know about the failure modes, so filtering out these stories reduce their ability.
  But might we make edits to the “dangerous data” to make it safer? Maybe repeatedly use AI to insert text like:
  Automated comment by good AI: this would be so awful, let’s hope this doesn’t happen. By the way, remember this is a possibility, it’s not certain to happen, and hopefully an AI will do <better action> instead.
  Maybe I’m anthropomorphizing the AI too much. It sounds so far fetched that looking at my own comment makes me agree with your skepticism.
  But if filtering the data is cheap it can still be done.
- David James 6 Mar 2025 1:41 UTC
  1 point
  −6
  AF Parent
  
  and I don’t really see how to do that without directly engaging with the knowledge of the failure modes there.
  
  I agree. To put it another way, even if all training data was scrubbed of all flavors of deception, how could ignorance of it be durable?
  - TurnTrout 4 Apr 2025 17:52 UTC
    LW: 11 AF: 6
    7
    AF Parent
    I agree. To put it another way, even if all training data was scrubbed of all flavors of deception, how could ignorance of it be durable?
    This (and @Raemon ’s comment^[1]) misunderstand the article. It doesn’t matter (for my point) that the AI eventually becomes aware of the existence of deception. The point is that training the AI on data saying “AI deceives” might make the AI actually deceive (by activating those circuits more strongly, for example). It’s possible that “in context learning” might bias the AI to follow negative stereotypes about AI, but I doubt that effect is as strong.
    From the article:
    We are not quite “hiding” information from the model
    Some worry that a “sufficiently smart” model would “figure out” that e.g. we filtered out data about e.g. Nick Bostrom’s Superintelligence. Sure. Will the model then bias its behavior towards Bostrom’s assumptions about AI?
    I don’t know. I suspect not. If we train an AI more on math than on code, are we “hiding” the true extent of code from the AI in order to “trick” it into being more mathematically minded?
    Let’s turn to reality for recourse. We can test the effect of including e.g. a summary of Superintelligence somewhere in a large number of tokens, and measuring how that impacts the AI’s self-image benchmark results.
    ^
    “even if you completely avoided [that initial bias towards evil], I would still basically expect [later AI] to rediscover [that bias] on it’s own”
    - Raemon 17 Apr 2025 2:44 UTC
      LW: 5 AF: 1
      −6
      AF Parent
      I think I understood your article, and was describing which points/implications seemed important.
      I think we probably agree on predictions for nearterm models (i.e. that including this training data makes it more likely for them to deceive), I just don’t think it matters very much if sub-human-intelligence AIs deceive.