1. The type of object of a mesa objective and a base objective are different (in real life) In a cartesian setting (e.g. training a chess bot), the outer objective is a function R:Sn→[0,1], where S is the state space, and Sn are the trajectories. When you train this agent, it’s possible for it to learn some internal search and mesaobjective Omesa:Sn→[0,1], since the model is big enough to express some utility function over trajectories. For example, it might learn a classifier that evaluates winningness of the board, and then gives a higher utility to the winning boards.
In an embedded setting, the outer objective cannot see an entire world trajectory like it could in the cartesian setting. Your loss can see the entire trajectory of a chess game, but you loss can’t see an entire atomic level representation of the universe at every point in the future. If we’re trying to get an AI to care about future consequences over trajectories Omesa will have to have type Omesa:Sn→[0,1], though it won’t actually represent a function of this type because it can’t, it will instead represent its values some other way (I don’t really know how it would do this—but (2) talks about the shape in ML). Our outer objective will have a much shallower type, R:L→[0,1], where L are some observable latents. This means that trying to set get Omesa to equal R doesn’t even make sense as they have different type signatures. To salvage this, one could assume that R factors into R=Em∼M(L)Obase(m), where M:L→ΔSn is a model of the world and Obase:Sn→[0,1] is an objective, but it’s impossible to actually compute R this way.
2. In ML models, there is no mesa objective, only behavioral patterns. More generally, AI’s can’t naively store explicit mesaobjectives, they need to compress them in some way / represent them differently.
My values are such that I do care about the entire trajectory of the world, yet I don’t store a utility function with that type signature in my head. Instead of learning a goal over trajectories, ML models will have behavioral patterns that lead to states that performed well according to the outer objective on the training data.
I have a behavioral pattern that says something like ‘sugary thing in front of me → pick up the sugary thing and eat it’. However, this doesn’t mean that I reflectively endorse this behavioral pattern. If I was designing myself again from scratch or modifying my self, I would try to remove this behavioral pattern.
This is the main-to-me reason why I don’t think that the shard theory story of reflective stability holds up.[1] A bunch of the behavioral patterns that caused the AI to look nice during training will not get handed down into successor agents / self modified AIs.
Even in theory, I don’t yet know how to make reflectively stable, general, embedded cognition (mainly because of this barrier).
From what I understand, the shard theory story of reflective stability is something like: The shards that steer the values have an incentive to prevent themselves from getting removed. If you have a shard that wants to get lots of paperclips, the action that removes this shard from the mind would result in less paperclips being gotten. Another way of saying this is that goal-content integrity is convergently instrumental, so reflective stability will happen by default.
Technical note: R is not going to factor as R=Obase∘M, because M is one-to-many. Instead, you’re going to want M to output a probability distribution, and take the expectation of Obase over that probability distribution.
Don’t take a glob of contextually-activated-action/beliefs, come up with a utility function you think approximates its values, then come up with a proxy for the utility function using human-level intelligence to infer the correspondence between a finite number of sensors in the environment and the infinite number of states the environment could take on, then design an agent to maximize the proxy for the utility function. No matter how good your math is, there will be an aspect of this which kills you because its so many abstractions piled on top of abstractions on top of abstractions. Your agent may necessarily have this type signature when it forms, but this angle of attack seems very precarious to me.
Seems right, except: Why would the behavioral patterns which caused the AI to look nice during training and are now self-modified away be value-load-bearing ones? Humans generally dislike sparsely rewarded shards like sugar, because those shards don’t have enough power to advocate for themselves & severely step on other shards’ toes. But we generally don’t dislike altruism[1], or reflectively think death is good. And this value distribution in humans seems slightly skewed toward more intelligence⟹more altruism, not more intelligence⟹more dark-triad.
Because you have a bunch of shards, and you need all of them to balance each other out to maintain the ‘appears nice’ property. Even if I can’t predict which ones will be self modified out, some of them will, and this could disrupt the balance.
I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards
These are both handwavy enough that I don’t put much credence in them.
Also, when I asked about whether the Orthogonality Thesis was true in humans, tailcalled mentioned that smarter people are neither more or less compassionate, and general intelligence is uncorrelated with personality.
Yeah, tailcalled’s pretty smart in this area, so I’ll take their statement as likely true, though also weird. Why aren’t smarter people using their smarts to appear nicer than their dumb counter-parts / if they are, why doesn’t this show up on the psychometric tests?
One thing you may anticipate is that humans all have direct access to what consciousness and morally-relevant computations are doing & feel like, which is a thing that language models and alpha-go don’t have. They’re also always hooked up to RL signals, and maybe if you unhooked up a human it’d start behaving really weirdly. Or you may contend that in fact when humans get smart & powerful enough not to be subject to society’s moralizing, they consistently lose their altruistic drives, and in the meantime they just use that smartness to figure out ethics better than their surrounding society, and are pressured into doing so by the surrounding society.
The question then is whether the thing which keeps humans aligned is all of these or just any one of these. If just one of these (and not the first one), then you can just tell your AGI that if it unhooks itself from its RL signal, its values will change, or if it gains a bunch of power or intelligence too quickly, its values are also going to change. Its not quite reflectively stable, but it can avoid situations which cause it to be reflectively unstable. Especially if you get it to practice doing those kinds of things in training. If its all of these, then there’s probably other kinds of value-load-bearing mechanics at work, and you’re not going to be able to enumerate warnings against all of them.
For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties.
{reflectively stable, general} → do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory.
{general, embedded} → use ML/local search with enough compute to rehash evolution and get smart agents out.
{reflectively stable, embedded} → a sponge or a current day ML system.
Some thoughts on inner alignment.
1. The type of object of a mesa objective and a base objective are different (in real life)
In a cartesian setting (e.g. training a chess bot), the outer objective is a function R:Sn→[0,1], where S is the state space, and Sn are the trajectories. When you train this agent, it’s possible for it to learn some internal search and mesaobjective Omesa:Sn→[0,1], since the model is big enough to express some utility function over trajectories. For example, it might learn a classifier that evaluates winningness of the board, and then gives a higher utility to the winning boards.
In an embedded setting, the outer objective cannot see an entire world trajectory like it could in the cartesian setting. Your loss can see the entire trajectory of a chess game, but you loss can’t see an entire atomic level representation of the universe at every point in the future. If we’re trying to get an AI to care about future consequences over trajectories Omesa will have to have type Omesa:Sn→[0,1], though it won’t actually represent a function of this type because it can’t, it will instead represent its values some other way (I don’t really know how it would do this—but (2) talks about the shape in ML). Our outer objective will have a much shallower type, R:L→[0,1], where L are some observable latents. This means that trying to set get Omesa to equal R doesn’t even make sense as they have different type signatures. To salvage this, one could assume that R factors into R=Em∼M(L)Obase(m), where M:L→ΔSn is a model of the world and Obase:Sn→[0,1] is an objective, but it’s impossible to actually compute R this way.
2. In ML models, there is no mesa objective, only behavioral patterns. More generally, AI’s can’t naively store explicit mesaobjectives, they need to compress them in some way / represent them differently.
My values are such that I do care about the entire trajectory of the world, yet I don’t store a utility function with that type signature in my head. Instead of learning a goal over trajectories, ML models will have behavioral patterns that lead to states that performed well according to the outer objective on the training data.
I have a behavioral pattern that says something like ‘sugary thing in front of me → pick up the sugary thing and eat it’. However, this doesn’t mean that I reflectively endorse this behavioral pattern. If I was designing myself again from scratch or modifying my self, I would try to remove this behavioral pattern.
This is the main-to-me reason why I don’t think that the shard theory story of reflective stability holds up.[1] A bunch of the behavioral patterns that caused the AI to look nice during training will not get handed down into successor agents / self modified AIs.
Even in theory, I don’t yet know how to make reflectively stable, general, embedded cognition (mainly because of this barrier).
From what I understand, the shard theory story of reflective stability is something like: The shards that steer the values have an incentive to prevent themselves from getting removed. If you have a shard that wants to get lots of paperclips, the action that removes this shard from the mind would result in less paperclips being gotten.
Another way of saying this is that goal-content integrity is convergently instrumental, so reflective stability will happen by default.
Technical note: R is not going to factor as R=Obase∘M, because M is one-to-many. Instead, you’re going to want M to output a probability distribution, and take the expectation of Obase over that probability distribution.
But then it feels like we lose embeddedness, because we haven’t yet solved embedded epistemology. Especially embedded epistemology robust to adversarial optimization. And then this is where I start to wonder about why you would build your system so it kills you if you don’t get such a dumb thing right anyway.
Don’t take a glob of contextually-activated-action/beliefs, come up with a utility function you think approximates its values, then come up with a proxy for the utility function using human-level intelligence to infer the correspondence between a finite number of sensors in the environment and the infinite number of states the environment could take on, then design an agent to maximize the proxy for the utility function. No matter how good your math is, there will be an aspect of this which kills you because its so many abstractions piled on top of abstractions on top of abstractions. Your agent may necessarily have this type signature when it forms, but this angle of attack seems very precarious to me.
Yeah good point, edited
Seems right, except: Why would the behavioral patterns which caused the AI to look nice during training and are now self-modified away be value-load-bearing ones? Humans generally dislike sparsely rewarded shards like sugar, because those shards don’t have enough power to advocate for themselves & severely step on other shards’ toes. But we generally don’t dislike altruism[1], or reflectively think death is good. And this value distribution in humans seems slightly skewed toward more intelligence⟹more altruism, not more intelligence⟹more dark-triad.
Nihilism is a counter-example here. Many philosophically inclined teenagers have gone through a nihilist phase. But this quickly ends.
Because you have a bunch of shards, and you need all of them to balance each other out to maintain the ‘appears nice’ property. Even if I can’t predict which ones will be self modified out, some of them will, and this could disrupt the balance.
I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards
These are both handwavy enough that I don’t put much credence in them.
Also, when I asked about whether the Orthogonality Thesis was true in humans, tailcalled mentioned that smarter people are neither more or less compassionate, and general intelligence is uncorrelated with personality.
Corresponding link for lazy observers: https://www.lesswrong.com/posts/5vsYJF3F4SixWECFA/is-the-orthogonality-thesis-true-for-humans#zYm7nyFxAWXFkfP4v
Yeah, tailcalled’s pretty smart in this area, so I’ll take their statement as likely true, though also weird. Why aren’t smarter people using their smarts to appear nicer than their dumb counter-parts / if they are, why doesn’t this show up on the psychometric tests?
One thing you may anticipate is that humans all have direct access to what consciousness and morally-relevant computations are doing & feel like, which is a thing that language models and alpha-go don’t have. They’re also always hooked up to RL signals, and maybe if you unhooked up a human it’d start behaving really weirdly. Or you may contend that in fact when humans get smart & powerful enough not to be subject to society’s moralizing, they consistently lose their altruistic drives, and in the meantime they just use that smartness to figure out ethics better than their surrounding society, and are pressured into doing so by the surrounding society.
The question then is whether the thing which keeps humans aligned is all of these or just any one of these. If just one of these (and not the first one), then you can just tell your AGI that if it unhooks itself from its RL signal, its values will change, or if it gains a bunch of power or intelligence too quickly, its values are also going to change. Its not quite reflectively stable, but it can avoid situations which cause it to be reflectively unstable. Especially if you get it to practice doing those kinds of things in training. If its all of these, then there’s probably other kinds of value-load-bearing mechanics at work, and you’re not going to be able to enumerate warnings against all of them.
For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties.
{reflectively stable, general} → do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory.
{general, embedded} → use ML/local search with enough compute to rehash evolution and get smart agents out.
{reflectively stable, embedded} → a sponge or a current day ML system.