The basic point is that the stuff we try to gesture towards as “human values,” or even “human actions” is not going to automatically be modeled by the AI.
Some examples of ways to model the world without using the-abstraction-I’d-call-human-values:
Humans are homeostatic mechanisms that maintain internal balance of oxygen, water, and a myriad of vitamins and minerals.
Humans are piloted by a collection of shards—it’s the shards that want things, not the human.
Human society is a growing organism modeled by some approximate differential equations.
The human body is a collection of atoms that want to obey the laws of physics.
Humans are agents that navigate the world and have values—and those values exactly correspond to economic revealed preferences.
Humans-plus-clothes-and-cell-phones are agents that navigate the world and have values...
And so on—there’s just a lot of ways to think about the world, including the parts of the world containing humans.
The obvious problem this creates is for getting our “detailed values” by just querying a pre-trained world model with human data or human-related prompts: If the pre-trained world model defaults to one of these many other ways of thinking about humans, it’s going to answer us using the wrong abstraction. Fixing this requires the world model to be responsive to how humans want to be modeled. It can’t be trained without caring about humans and then have the caring bolted on later.
But this creates a less-obvious problem for empowerment, too. What we call “empowerement” relies on what part of the world we are calling the “agent,” what its modeled action-space is, etc. The AI that says “The human body is a collection of atoms that want to obey the laws of physics” is going to think of “empowerment” very differently than the AI that says “Humans-plus-clothes-and-cell-phones are agents.” Even leaving aside concerns like Steve’s over whether empowerment is what we want, most of our intuitive thinking about it relies on the AI sharing our notion of what it’s supposed to be empowering, which doesn’t happen by default.
The basic point is that the stuff we try to gesture towards as “human values,” or even “human actions” is not going to automatically be modeled by the AI.
I disagree, and have already spent some words arguing for why (section 4.1, and earlier precursors) - so I’m curious what specifically you disagree with there? But I’m also getting the impression you are talking about a fundamentally different type of AGI.
I’m discussing future DL based AGI which is—to first approximation—just a virtual brain. As argued in section 2⁄3, current DL models already are increasingly like brain modules. So your various examples are simply not how human brains are likely to model other human brains and their values. All the concepts you mention—homestatic mechanisms, ‘shards’, differential equations, atoms, economic revealed preferences, cell phones, etc—these are all high level linguistic abstractions that are not much related to how the brain’s neural nets actually model/simulate other humans/agents. This must obviously be true because empathy/altruism existed long before the human concepts you mention.
The obvious problem this creates is for getting our “detailed values” by just querying a pre-trained world model with human data or human-related prompts:
You seem to be thinking of the AGI as some sort of language model which we query? But that’s just a piece of the brain, and not even the most relevant part for alignment. AGI will be a full brain equivalent, including the modules dedicated to long term planning, empathic simulation/modeling, etc.
Even leaving aside concerns like Steve’s over whether empowerment is what we want, most of our intuitive thinking about it relies on the AI sharing our notion of what it’s supposed to be empowering, which doesn’t happen by default.
Again for successful brain-like AGI this just isn’t an issue (assuming human brains model the empowerment of others as a sort of efficient approximate bound).
Upon more thought, I definitely agree with you more, but still sort of disagree.
You’re absolutely right that I wasn’t actually thinking about the kind of AI you were talking about. And evolution does reliably teach animals to have theory of mind. And if the training environment is at least sorta like our ancestral environment, it does seem natural that an AI would learn to draw the boundary around humans more or less the same way we do.
But our evolved theory of mind capabilities are still fairly anthropocentric, suited to the needs, interests, and capabilities of our ancestors, even when we can extend them a bit using abstract reasoning. Evolving an AI in a non-Earth environment in a non-human ecological niche, or optimizing an AI using an algorithm that diverges from evolution (e.g. by allowing more memorization of neuron weights) would give you different sorts of theories of mind.
Aside: I disagree that the examples I gave in the previous comment require verbal reasoning. They can be used nonverbally just fine. But using a model doesn’t feel like using a model, it feels like perceiving the world. E.g. I might say “birds fly south to avoid winter,” which sounds like a mere statement but actually inserts my own convenient model of the world (where “winter” is a natural concept) into a statement about birds’ goals.
An AI that’s missing some way of understanding the world that humans find natural might construct a model of our values that’s missing entire dimensions. Or an AI that understands the world in ways we don’t might naturally construct a model of our values that has a bunch of distinctions and details where we would make none.
What it means to “empower” some agent does seem more convergent than that. Maybe not perfectly convergent (e.g. evaluating social empowerment seems pretty mixed-up with subtle human instincts), but enough that I have changed my mind, and am no longer most concerned about the AI simply failing to locate the abstraction we’re trying to point to.
So it sounds like we are now actually mostly in agreement.
I agree there may be difficulties learning and grounding accurate mental models of human motivations/values into the AGI, but that is more reason to take the brain-like path with anthropomorphic AGI. Still I hedge between directly emulating human empathy/altruism vs using external empowerment. External empowerment may be simpler/easier to specify and thus more robust against failures to match human value learning more directly, but it also has it’s own potential specific failure modes (The AGI would want to keep you alive and wealthy, but it may not care about your suffering/pain as much as we’d like). But I do also suspect that it could turn out that human value learning follows a path of increasing generalization and robustness, starting with oldbrain social instincts as proxy to ground newbrain empathy learning which eventually generalizes widely to something more like external empowerment. At least some humans generally optimize for the well-being (non-suffering) and possibly empowerment of animals, and that expanded/generalized circle of empathy will likely include AI, even if it doesn’t obviously mimics human emotions.
The basic point is that the stuff we try to gesture towards as “human values,” or even “human actions” is not going to automatically be modeled by the AI.
Some examples of ways to model the world without using the-abstraction-I’d-call-human-values:
Humans are homeostatic mechanisms that maintain internal balance of oxygen, water, and a myriad of vitamins and minerals.
Humans are piloted by a collection of shards—it’s the shards that want things, not the human.
Human society is a growing organism modeled by some approximate differential equations.
The human body is a collection of atoms that want to obey the laws of physics.
Humans are agents that navigate the world and have values—and those values exactly correspond to economic revealed preferences.
Humans-plus-clothes-and-cell-phones are agents that navigate the world and have values...
And so on—there’s just a lot of ways to think about the world, including the parts of the world containing humans.
The obvious problem this creates is for getting our “detailed values” by just querying a pre-trained world model with human data or human-related prompts: If the pre-trained world model defaults to one of these many other ways of thinking about humans, it’s going to answer us using the wrong abstraction. Fixing this requires the world model to be responsive to how humans want to be modeled. It can’t be trained without caring about humans and then have the caring bolted on later.
But this creates a less-obvious problem for empowerment, too. What we call “empowerement” relies on what part of the world we are calling the “agent,” what its modeled action-space is, etc. The AI that says “The human body is a collection of atoms that want to obey the laws of physics” is going to think of “empowerment” very differently than the AI that says “Humans-plus-clothes-and-cell-phones are agents.” Even leaving aside concerns like Steve’s over whether empowerment is what we want, most of our intuitive thinking about it relies on the AI sharing our notion of what it’s supposed to be empowering, which doesn’t happen by default.
I disagree, and have already spent some words arguing for why (section 4.1, and earlier precursors) - so I’m curious what specifically you disagree with there? But I’m also getting the impression you are talking about a fundamentally different type of AGI.
I’m discussing future DL based AGI which is—to first approximation—just a virtual brain. As argued in section 2⁄3, current DL models already are increasingly like brain modules. So your various examples are simply not how human brains are likely to model other human brains and their values. All the concepts you mention—homestatic mechanisms, ‘shards’, differential equations, atoms, economic revealed preferences, cell phones, etc—these are all high level linguistic abstractions that are not much related to how the brain’s neural nets actually model/simulate other humans/agents. This must obviously be true because empathy/altruism existed long before the human concepts you mention.
You seem to be thinking of the AGI as some sort of language model which we query? But that’s just a piece of the brain, and not even the most relevant part for alignment. AGI will be a full brain equivalent, including the modules dedicated to long term planning, empathic simulation/modeling, etc.
Again for successful brain-like AGI this just isn’t an issue (assuming human brains model the empowerment of others as a sort of efficient approximate bound).
Upon more thought, I definitely agree with you more, but still sort of disagree.
You’re absolutely right that I wasn’t actually thinking about the kind of AI you were talking about. And evolution does reliably teach animals to have theory of mind. And if the training environment is at least sorta like our ancestral environment, it does seem natural that an AI would learn to draw the boundary around humans more or less the same way we do.
But our evolved theory of mind capabilities are still fairly anthropocentric, suited to the needs, interests, and capabilities of our ancestors, even when we can extend them a bit using abstract reasoning. Evolving an AI in a non-Earth environment in a non-human ecological niche, or optimizing an AI using an algorithm that diverges from evolution (e.g. by allowing more memorization of neuron weights) would give you different sorts of theories of mind.
Aside: I disagree that the examples I gave in the previous comment require verbal reasoning. They can be used nonverbally just fine. But using a model doesn’t feel like using a model, it feels like perceiving the world. E.g. I might say “birds fly south to avoid winter,” which sounds like a mere statement but actually inserts my own convenient model of the world (where “winter” is a natural concept) into a statement about birds’ goals.
An AI that’s missing some way of understanding the world that humans find natural might construct a model of our values that’s missing entire dimensions. Or an AI that understands the world in ways we don’t might naturally construct a model of our values that has a bunch of distinctions and details where we would make none.
What it means to “empower” some agent does seem more convergent than that. Maybe not perfectly convergent (e.g. evaluating social empowerment seems pretty mixed-up with subtle human instincts), but enough that I have changed my mind, and am no longer most concerned about the AI simply failing to locate the abstraction we’re trying to point to.
So it sounds like we are now actually mostly in agreement.
I agree there may be difficulties learning and grounding accurate mental models of human motivations/values into the AGI, but that is more reason to take the brain-like path with anthropomorphic AGI. Still I hedge between directly emulating human empathy/altruism vs using external empowerment. External empowerment may be simpler/easier to specify and thus more robust against failures to match human value learning more directly, but it also has it’s own potential specific failure modes (The AGI would want to keep you alive and wealthy, but it may not care about your suffering/pain as much as we’d like). But I do also suspect that it could turn out that human value learning follows a path of increasing generalization and robustness, starting with oldbrain social instincts as proxy to ground newbrain empathy learning which eventually generalizes widely to something more like external empowerment. At least some humans generally optimize for the well-being (non-suffering) and possibly empowerment of animals, and that expanded/generalized circle of empathy will likely include AI, even if it doesn’t obviously mimics human emotions.