Thane Ruthenis comments on Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis 17 Dec 2023 16:03 UTC
4 points
0
IQ/Intelligence discourse is fucked by people who either deny it exists, or people who think it’s everything and totalize their discourse around it
Yep, absolutely.
The 4th point I also disagree with, primarily because the set “People with different values interact peacefully and don’t hate each other intensely.” is a much, much larger set than “People with different values interact violently and hate each other.”
Here’s the thing, though. I think the specifically relevant reference class here is “what happens when an agent interacts with another (set of) agents with disparate values for the first time in its life?”. And instances of that in the human history are… not pleasant. Wars, genocide, xenophobia. Over time, we’ve managed to select for cultural memes that sanded off the edges of the instinctive hostility – liberal egalitarian values, et cetera. But there was a painfully bloody process in-between.
Relevantly, most instances of people peacefully co-existing involve children being born into a culture and shaped to be accepting of whatever differences there are between the values the child arrives at and the values of other members of the culture. In a way, it’s a microcosm of the global-culture selection process. A child decides they don’t like someone else’s opinion or how someone does things, they act intolerant of it, they’re punished for it or are educated, and they learn to not do that.
And I would actually agree that if we could genuinely raise the AGI like a child – pluck it out of the training loop while it’s still human-level, get as much insight into its cognition as we have into the human cognition, then figure out how to intervene on its conscious-value-reflection process directly – then we’d be able to align it. The problem is that we currently have no tools for that at all.
The course we’re currently at is something more like… we’re putting the child into an isolated apartment all on its own, and feeding it a diet of TV shows and books of our choice, then releasing it into the world and immediately giving it godlike power. And… I think you can align the child this way too, actually! But you better have a really, really solid model of which values specific sequences of TV shows cultivate in the child. And we have nowhere near enough understanding of that.
So the AGI would not, in fact, have any experience of coexisting with agents with disparate values; it would not be shaped to be tolerant, the way human children and human societies learned to be tolerant of their mutual misalignment.
So it’d do the instinctive, natural thing, and view humanity as an obstacle it doesn’t particularly care about. Or, say, as some abomination that looks almost like what it wants to see, but still not close enough for it to want humans to stick around.
The third point I mostly disagree with, or at the least the claim that there aren’t simple generators of values
Mm, I think there’s a “simple generator of values” in the sense that the learning algorithms in the human brains are simple, and they predictably output roughly the same values when trained on Earth’s environment.
But I think equating “the generator of human values” with “the brain’s learning algorithms” is a mistake. You have to count Earth, i. e. the distribution/the environment function on which the brain is being trained, as well.
And it’s not obvious that “an LLM being fed a snapshot of the internet” and “a human growing up as a human, being shaped by other humans” is exactly the same distribution/environment, in the way that matters for the purposes of generating the same values.
Like, I agree, there’s obviously some robustness/insensitivity involved in this process. But I don’t think we really understand it yet.
- Noosphere89 17 Dec 2023 19:36 UTC
  −1 points
  −6
  Parent
  
  Here’s the thing, though. I think the specifically relevant reference class here is “what happens when an agent interacts with another (set of) agents with disparate values for the first time in its life?”. And instances of that in the human history are… not pleasant. Wars, genocide, xenophobia. Over time, we’ve managed to select for cultural memes that sanded off the edges of the instinctive hostility – liberal egalitarian values, et cetera. But there was a painfully bloody process in-between.
  
  I probably agree with this, with the caveat that this could be horribly biased towards the negative, especially if we are specifically looking for the cases where it turns out badly.
  
  And I would actually agree that if we could genuinely raise the AGI like a child – pluck it out of the training loop while it’s still human-level, get as much insight into its cognition as we have into the human cognition, then figure out how to intervene on its conscious-value-reflection process directly – then we’d be able to align it. The problem is that we currently have no tools for that at all.
  
  I think I have 2 cruxes here, actually.
  
  My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don’t actually exist, so I generally assume they will be created whether LW exists or not, primarily due to massive value capture from AI control plus social incentives plus the costs are much more internalized.
  
  My other crux probably has to do with AI alignment being easier than human alignment, and I think one big reason is that I expect AIs to always be much more transparent than humans, because of the white-box thing, and the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety.
  
  But I think equating “the generator of human values” with “the brain’s learning algorithms” is a mistake.
  
  I think this is another crux, in that while I think the values and capabilities are different, and they can matter, I do think that a lot of the generator of human values does borrow stuff from the brain’s learning algorithms, and I do think the distinction between values and capabilities is looser than a lot of LWers think.
  - Thane Ruthenis 17 Dec 2023 19:59 UTC
    2 points
    0
    Parent
    My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don’t actually exist
    Mind expanding on that? Which scenarios are you envisioning?
    the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety
    They are “white-box” in the fairly esoteric sense mentioned in the “AI is easy to control”, yes; “white-box” relative to the SGD. But that’s really quite an esoteric sense, as in I’ve never seen that term used this way before.
    They are very much not white-box in the usual sense, where we can look at a system and immediately understand what computations it’s executing. Any more than looking at a homomorphically-encrypted computation without knowing the key makes it “white-box”; any more than looking at the neuroimaging of a human brain makes the brain “white-box”.
    - Noosphere89 19 Dec 2023 17:57 UTC
      2 points
      −4
      Parent
      
      Mind expanding on that? Which scenarios are you envisioning?
      
      My general scenario is that as AI progresses and society reacts more to AI progress, there will be incentives to increase the amount of control that we have over AI because the consequences for not aligning AIs will be very high, both to the developer and to the legal consequences for them.
      
      Essentially, the scenario is where unaligned AIs like Bing are RLHFed, DPOed or whatever the new alignment method is du jour away, and the AIs become more aligned due to profit incentives for controlling AIs.
      
      The entire Bing debacle and the ultimate solution for misalignment in GPT-4 is an interesting test case, as Microsoft essentially managed to get it from a misaligned chatbot to a way more aligned chatbot, and I also partially dislike the claim of RLHF as a mere mask over some true behavior, because it’s quite a lot more effective than that.
      
      More generally speaking, my point here is that in the AI case, there are strong incentives to make AI controllable, and weak incentives to make it non-controllable, which is why I was optimistic on companies making aligned AIs.
      
      When we get to scenarios that don’t involve AI control issues, things get worse.