Thane Ruthenis comments on Another argument against maximizer-centric alignment paradigms

Thane Ruthenis 23 Sep 2024 5:30 UTC
7 points
3
Mm, there are two somewhat different definitions of what counts as “a natural abstraction”:
- I would agree that human values are likely a natural abstraction in the sense that if you point an abstraction-learning algorithm at the dataset of modern humans doing things, “human values” and perhaps even “eudaimonia” would fall out as a natural principal component of that dataset’s decomposition.
- What I wouldn’t agree with is that human values are a natural abstraction in the sense that a mind pointed at the dataset of this universe doing things, or at the dataset of animals doing things, or even at the dataset of prehistoric or medieval humans doing things, would learn modern human values.
Let’s step back a bit.
Suppose we have a system Alpha and a system Beta, with Beta embedded in Alpha. Alpha starts out with a set of natural abstractions/subsystems. Beta, if it’s an embedded agent, learns these abstractions, and then starts executing actions within Alpha that alter its embedding environment. Over the course of that, Beta creates new subsystems, corresponding to new abstractions.
As concrete examples, you can imagine:
- The lifeless universe as Alpha (with abstractions like “stars”, “gasses”, “seas”), and the biosphere as Beta (creating abstractions like “organisms” and “ecosystems” and “predator” and “prey”).
- The biosphere as Alpha (with abstractions like “food” and “species”) and the human civilization as Beta (with abstractions like “luxury” and “love” and “culture”).
Notice one important fact: the abstractions Beta creates are not, in general, easy-to-predict from the abstractions already in Alpha. “A multicellular organism” or “an immune-system virus” do not naturally fall out of descriptions of geological formations and atmospheric conditions. They’re highly contingent abstraction, ones that are very sensitive to the exact conditions in which they formed. (Biochemistry, the broad biosphere the system is embedded in...)
Similarly, things like “culture” or “eudaimonia” or “personal identity”, the way humans understand them, don’t easily fall out of even the abstractions present in the biosphere. They’re highly contingent on the particulars of how human minds and bodies are structured, how they exchange information, et cetera.
In particular: humans, despite being dropped into an abstraction-rich environment, did not learn values that just mirror some abstraction present in the environment. We’re not wrapper-minds single-mindedly pursuing procreation, or the eradication of predators, or the maximization of the number of stars. Similarly, animals don’t learn values like “compress gasses”.
What Beta creates are altogether new abstractions defined in terms of complicated mixes of Alpha’s abstractions. And if Beta is the sort of system that learns values, it learns values that wildly mix the abstractions present in Beta. These new abstractions are indeed then just some new natural abstraction. But they’re not necessarily “simple” in terms of Alpha’s abstractions.
And now we come to the question of what values an AGI would learn. I would posit that, on the current ML paradigm, the setup is the basic Alpha-and-Beta setup, with the human civilization being Alpha and the AGI being Beta.
Yes, there are some natural abstractions in Alpha, like “eudaimonia”. But to think that the AGI would just naturally latch onto that single natural abstraction, and define its entire value system over it, is analogous to thinking that animals would explicitly optimize for gas-compression, or humans for predator-elimination or procreation.
I instead strongly expect that the story would just repeat. The training process (or whatever process spits out the AGI) would end up creating some extremely specific conditions in which the AGI is learning the values. Its values would then necessarily be some complicated functions over weird mixes of the abstractions-natural-to-the-dataset-it’s-trained-on, with their specifics being highly contingent on some invisible-to-us details of that process.
It would not be just “eudaimonia”, it’d be some weird nonlinear function of eudaimonia and a random grab-bag of other things, including the “Beta-specific” abstractions that formed within the AGI over the course of training. And the output would not necessarily have anything to do with “eudaimonia” in any recognizable way, the way “avoid predators” is unrecognizable in terms of “rocks” and “aerodynamics”, and “human values” are unrecognizable in terms of “avoid predators” or “maximize children”.
- Noosphere89 23 Sep 2024 14:33 UTC
  4 points
  2
  Parent
  I feel like the difference between the Alpha and Beta examples and my examples mediate through your examples having basically no control of Beta’s data at all, and my examples having far more control over what data is learned by the AI.
  
  I think the key crux is whether we have much more control over AI data sources than evolution.
  
  If I agreed with you that we would have essentially no control on what data the AI has, I’d be a lot more worried, but I don’t think this is true, and I expect future AIs, including AGIs, to be a lot more built than grown, and for a lot of their data to be very carefully controlled via synthetic data, for simple capabilities reasons, but this can also be used for alignment strategies.
  
  I think another disagreement is I basically don’t buy the evolution analogy for DL, and I think there are some deep disanalogies (the big one for now is again how much more control over data sources than evolution, and this is only set to increase with synthetic data).
  
  So I basically don’t expect this to happen:
  
  I instead strongly expect that the story would just repeat. The training process (or whatever process spits out the AGI) would end up creating some extremely specific conditions in which the AGI is learning the values. Its values would then necessarily be some complicated functions over weird mixes of the abstractions-natural-to-the-dataset-it’s-trained-on, with their specifics being highly contingent on some invisible-to-us details of that process.
  
  Pretty much all of your examples rely on the Alpha being unable to control the data learnt by Beta, and if this isn’t the case, your examples break down.
- Benjy_Forstadt 23 Sep 2024 6:40 UTC
  1 point
  0
  Parent
  I don’t think the way you split things up into Alpha and Beta quite carves things at the joints. If you take an individual human as Beta, then stuff like “eudaimonia” is in Alpha—it’s a concept in the cultural environment that we get exposed to and sometimes come to value. The vast majority of an individual human’s values are not new abstractions that we develop over the course of our training process (for most people at least).
  - Benjy_Forstadt 23 Sep 2024 15:15 UTC
    1 point
    0
    Parent
    Basically people tend to value stuff they perceive in the biophysical environment and stuff they learn about through the social environment.
    
    So that reduces the complexity of the problem—it’s not a matter of designing a learning algorithm that both derives and comes to value human abstractions from observations of gas particles or whatever. That’s not what humans do either.
    
    Okay then, why aren’t we star-maximizers or number-of-nation-states maximizers? Obviously it’s not just a matter of learning about the concept. The details of how we get values hooked up to an AGI’s motivations will depend on the particular AGI design but probably reward, prompting, scaffolding or the like.