I appreciate you writing your quick thoughts on this. I have a few primary reactions, and then I’ll detail specific reactions.
I agree there are difficulties in this plan template (and said so in the original post; I know you didn’t say I didn’t say so, but I’m adding this here for clarity).
I don’t know why you think this isn’t progress, because this plan’s problems seem to just… go away, if they’re solved?
Like, if you figure out how to form an appropriate diamond abstraction, then that’s that. Congrats, that part of the story is checked off.
But if you get a more robust reward model, then there’s always another way to hack it.
I surmise that you don’t get this feeling from the post, or think I’m sweeping problems under some rug, but I don’t know where the rug is supposed to be. (Maybe the “unnaturality” disagreement?)
Of your three points, I think
(1: Won’t get AGI) seems wrong for this particular plan but also not cruxy to me. (Was this meant to apply to shard theory more broadly? If so, why?)
EDIT 11/25: Also, it doesn’t seem like a big problem to me if it were true that “When you compromise and start putting it in environments where it needs to be able to think to succeed, then your new reward-signals end up promoting all sorts of internal goals that aren’t particularly about diamond, but are instead about understanding the world and/or making efficient use of internal memory and/or suchlike.” Seems fine if some of the agent’s values are around understanding the world and suchlike.
(2: Proxy goal formation) seems like one of my main worries, and also quite surmountable.
(3: The values blow up when it gets smart) I agree that the reflection process seems sensitive in some ways. I also give the straightforward reason why the diamond-values shouldn’t blow up: Because that leads to fewer diamonds. I think this a priori case is pretty strong, but agree that there should at least be a lot more serious thinking here, eg a mathematical theory of value coalitions.
I think some of your critiques were covered in my original post. It was a long post, so no worries if you just missed them.
The first “problem” with this plan is that you don’t get an AGI this way. You get an unintelligent robot that steers towards diamonds. If you keep trying to have the training be about diamonds, it never particularly learns to think. When you compromise and start putting it in environments where it needs to be able to think to succeed, then your new reward-signals end up promoting all sorts of internal goals that aren’t particularly about diamond, but are instead about understanding the world and/or making efficient use of internal memory and/or suchlike.
Hm, doesn’t it need to think in its curriculum I described in the OP?
Produce a curriculum of tasks, from walking to a diamond, to winning simulated chess games [self-play], to solving increasingly difficult real-world mazes, and so on. After each task completion, the agent gets to be near some diamond and receives reward.
For further detail, take an arbitrary task with a high skill ceiling and a legible end condition, give it some reward shaping and use self-play if appropriate, and put a diamond at the end and give the agent reward. I agree that even in successful stories, the agent also develops non-diamond shards.
Here’s a consideration for why training might produce an AGI, which I realized after writing the story. Given relevant features, it’s often trivial for even linear models to outperform experts (see Statistical Prediction Rules Out-Perform Expert Human Judgments). What I remember to be a common hypothesis: Human experts are often good at finding features to pay attention to (e.g. patient weight) but bad at setting regression coefficients to come to a decision.
Analogously, consider an SSL+IL initialization in which the AI has imitatively learned sophisticated subroutines for perception, prediction, and action, such that the AI can imitate human-level performance on supervised training distribution (eg navigating mazes). Then PG-style RL finetuning might rearrange and reweight what subroutines to use when, efficiently finding a better subroutine arrangement for decision-making in a range of situations. And thereby doing better than human expert demonstrators.
(Yes, this is sample inefficient, and I didn’t particularly optimize the story for sample efficiency. I focused on telling any story at all which has the desired alignment outcome.)
insofar as you were able to get some sort of internalized diamond-ish goal, if you’re not really careful then you end up getting lots of subgoals such as ones about glittering things, and stones cut in stylized ways, and proximity to diamond rather than presence of diamond, and so on and so forth.
Why “rather than” instead of “in addition to”? Are you just stating your belief here, or did you mean to argue for it? Maybe you’re saying “It’s hard to get the diamond shard to form properly”, which I agree with and it’s a primary way I expect the story to go wrong. I think that relatively simple interventions will plausibly solve this problem, though, and so consider this more of a research question than a fatal flaw in the training story template.
As far as I can tell, the “reflection” section of TurnTrout’s essay says ~nothing that addresses this, and amounts to “the agent will become able to tell that it has shards”. OK, sure, it has shards, but only some of them are diamond-related, and many others are cognition-related or suchlike. I don’t see any argument that reflection will result in the AI settling at “maximize diamond” in-particular.
If I read you properly, that’s not the relevant section. The relevant sections are the next two: The agent prevents value drift and The values handshake. EG I said:
If the agent still is primarily diamond-motivated, it now wants to stay that way by instrumental convergence. That is, if the AI considers a plan which it knows causes value drift away from diamonds, then the AI reflectively predicts the plan leads to fewer diamonds, and so the AI doesn’t choose that plan! The agent knows the consequences of value drift and it takes a more careful approach to future updating.
I think there’s a very straightforward case here. In the relevant context, suppose the agent is primarily making decisions on the basis of whether they lead to more or fewer diamonds. The agent considers adopting a reflectively stable utility function which doesn’t produce diamonds. The agent doesn’t choose this plan because it doesn’t lead to diamonds.
I agree that there are ways this can go wrong, some of which you highlight. But the a priori argument makes me expect that, all else equal and conditional on a strong diamond shard at time of values handshake, the agent will probably equilibrate to making lots of diamonds.
I’ll note that the diamond maximization problem is not in fact the problem “build an AI that makes a little diamond”
I did not claim to be solving the diamond maximization problem, but maybe you wanted to add your own take here? As I wrote in the original post, I think “maximize diamonds” is a seriously mistaken subproblem choice:
I think that pure diamond maximizers are anti-natural, and at least not the first kind of successful story we should try to tell...
I think that “get an agent which reflectively equilibrates to optimizing a single commonly considered quantity like ‘diamonds’” is probably extremely hard and anti-natural. I think MIRI should not have chosen this as a subproblem.
I also think that relaxing the problem by assuming hypercomputation encourages thinking about argmax search, which I think is a subtle but serious trap. For specific generalizable reasons which I’ll soon post about, this design pattern seems basically impossible to align compared to shard agents.
because the optimum of the shattered correlates of the training objectives that it gets are likely to involve tiling the universe with something that isn’t actually diamond, even if you’re lucky-enough that it got a diamond-shard at all, which is dubious)
Really? That seems wrong. Suppose that the time of the values handshake, the agent has a strong diamond-shard. I understand you to predict that the agent adopts a reflective utility function which, when optimized, won’t lead to actual diamond. Why? Why wouldn’t the diamond-shard just bid this plan down, because it doesn’t lead to actual diamond?
even if it works a little, it doesn’t seem to me to be teaching us any of the insights that would be possessed by someone who knew how to robustly aim an idealized unbounded (or even hypercomputing) cognitive system in theory.
In addition to my “unbounded/hypercomputing is a red herring” response:
Someone can say “You can reliably solve computer vision tasks by doing deep learning” isn’t telling you how to write superhumanly good features into the vision model, surpassing previous hand-designed expert attempts. They don’t know how the SOTA deep vision models will work internally. And yet it’s still good advice. It’s still telling you something about how to train good vision models.
Similarly, if you’re in a state of ignorance (lethality 19) about how to reliably point any cognitive system to any latent parts of reality, and someone proposes a plan which does plausibly (for specific reasons, not as a vague “it could work” hope) produce an AI which makes lots of real-world diamonds, then that seems like progress to me. (I’m fine agreeing to disagree here, I don’t think it’s productive to dispute how much credit I should get.)
Smaller points:
In a similar fashion, the other “shards” that the shard theory folk want to learn are unnatural too.
I think it would make more sense to claim that niceness / other shards are “contingent” instead of “unnatural.” If shard theory is correct, shards are literally natural in that they are found in nature as the predictable outcome of human value formation. Same for niceness.
little correlates-of-training-objectives that it latched onto in order to have a gradient up to general intelligence, blow the whole plan sky-high once it starts to reflect.
You call shards “little correlates” and, previously, “ad-hoc internalized correlates.” I don’t know what you intend to communicate by this. The shards are, mechanistically speaking, contextually activated influences on the agent’s decision-making. What information does “ad-hoc” or “little correlate” add to that picture? I’m currently guessing that it expresses your skepticism that shards can cohere into reflectively stable caring?
Or consider the conflict “I really enjoy dunking on the outgroup (but have some niggling sense of unease about this)” — we can’t conclude from the fact that the enjoyment of dunking is loud, whereas the niggling doubt is quiet, that the dunking-on-the-outgroup value will be the one left standing after reflection.
This is an interesting example. To me, the more relevant questions seem to be: How much evidence is “loudness” (e.g. if I really enjoy something which I do frequently, I sure am more likely to reflectively endorse it compared to if I didn’t enjoy it, even though there are highly available counterexamples to this tendency), and how relevant is this for the diamond story?
EDIT: As I think I wrote in the OP, it’s not enough for a shard to be strongly influencing decision-making in a given context. Especially for an anti-outgroup shard which is unendorsed (eg bids for outcomes which other reflectively aware shards bid against), this shard also seemingly has to be reflectively and broadly activated in order to be retained. So, yeah, if there’s an anti-outgroup shard which gets “maneuvered around and removed” by other shards, sure, that can happen. My takeaway isn’t “anything can get removed for hard-to-understand reasons”, but rather “one particular way shards can get removed is that they directly conflict with other powerful shards.”
I think a diamond-manufacturing subshard would resource-conflict (instrumental conflict, not terminal conflict) with eg a power-seeking subshard (manufacturing diamonds uses energy). Or even against a staple-manufacturing subshard (staples require materials and energy). But I expect the reflective utility function to reflect gains from intershard trade and specialization of different parts of the future resources towards the different decision-making influences (eg maybe one kind of comet is better specialized for making staples, and another kind for diamonds).
Or maybe not. Maybe it goes some other way. But this kind of conflict seems different from anticorrelated terminal value (eg anti-outgroup can impinge on nice-shards, altruism-shards, empathy...) across a shard power imbalance (nonreflective anti-outgroup vs reflective niceness shard).
And my point here isn’t “I have now defused the general class of objection, checkmate!”… It’s still a live and legit worry to me, but I don’t view this phenomenon as not comprehensible, I don’t feel epistemically helpless here (not meaning to make claims about how you feel tbc).
(My take on the reflective stability part of this)
The reflective equilibrium of a shard theoretic agent isn’t a utility function weighted according to each of the shards, it’s a utility function that mostly cares about some extrapolation of the (one or very few) shard(s) that were most tied to the reflective cognition.
It feels like a ‘let’s do science’ or ‘powerseek’ shard would be a lot more privileged, because these shards will be tied to the internal planning structure that ends up doing reflection for the first time.
There’s a huge difference between “Whenever I see ice cream, I have the urge to eat it”, and “Eating ice cream is a fundamentally morally valuable atomic action”. The former roughly describes one of the shards that I have, and the latter is something that I don’t expect to see in my CEV. Similarly, I imagine that a bunch of the safety properties will look more like these urges because the shards will be relatively weak things that are bolted on to the main part of the cognition, not things that bid on the intelligent planning part. The non-reflectively endorsed shards will be seen as arbitrary code that is attached to the mind that the reflectively endorsed shards have to plan around (similar to how I see my “Whenever I see ice cream, I have the urge to eat it” shard.
In other words: there is convergent pressure for CEV-content integrity, but that does not mean that the current way of making decisions (e.g. shards) is close to the CEV optimum, and the shards will choose to self modify to become closer to their CEV.
I don’t feel epistemically helpless here either, and would love a theory of which shards get preserved under reflection.
Or consider the conflict “I really enjoy dunking on the outgroup (but have some niggling sense of unease about this)” — we can’t conclude from the fact that the enjoyment of dunking is loud, whereas the niggling doubt is quiet, that the dunking-on-the-outgroup value will be the one left standing after reflection.
Assuming shard theory is basically correct, this aspect of Nate’s story can be resolved by viewing self-reflection as a context like any other. If you put the system in a training setup which causes it to self-reflect, and reward it when it comes to the ‘more diamonds’ conclusion, then this should cause it to reflectively want more diamonds.
The only question is, how much does training it to max diamonds in maze finding cause the ‘max diamonds’ shard to be activated while in the self-reflecting context?
Also, notably, it will definitely be doing a modicum of self-reflection during the normal course of training, as the shards which do self-reflection will steer the future towards locations which reinforce their weight.
TurnTrout’s proposal seems to me to be basically “train it around diamonds, do some reward-shaping, and hope that at least some care-about-diamonds makes it across the gap”.
I read a connotation here like “TurnTrout isn’t proposing anything sufficiently new and impressive.” To be clear, I don’t think I’m proposing an awesome new alignment technique. I’m instead proposing that we don’t need one.
I appreciate you writing your quick thoughts on this. I have a few primary reactions, and then I’ll detail specific reactions.
I agree there are difficulties in this plan template (and said so in the original post; I know you didn’t say I didn’t say so, but I’m adding this here for clarity).
I don’t know why you think this isn’t progress, because this plan’s problems seem to just… go away, if they’re solved?
Like, if you figure out how to form an appropriate diamond abstraction, then that’s that. Congrats, that part of the story is checked off.
But if you get a more robust reward model, then there’s always another way to hack it.
I surmise that you don’t get this feeling from the post, or think I’m sweeping problems under some rug, but I don’t know where the rug is supposed to be. (Maybe the “unnaturality” disagreement?)
Of your three points, I think
(1: Won’t get AGI) seems wrong for this particular plan but also not cruxy to me. (Was this meant to apply to shard theory more broadly? If so, why?)
EDIT 11/25: Also, it doesn’t seem like a big problem to me if it were true that “When you compromise and start putting it in environments where it needs to be able to think to succeed, then your new reward-signals end up promoting all sorts of internal goals that aren’t particularly about diamond, but are instead about understanding the world and/or making efficient use of internal memory and/or suchlike.” Seems fine if some of the agent’s values are around understanding the world and suchlike.
(2: Proxy goal formation) seems like one of my main worries, and also quite surmountable.
(3: The values blow up when it gets smart) I agree that the reflection process seems sensitive in some ways. I also give the straightforward reason why the diamond-values shouldn’t blow up: Because that leads to fewer diamonds. I think this a priori case is pretty strong, but agree that there should at least be a lot more serious thinking here, eg a mathematical theory of value coalitions.
I think some of your critiques were covered in my original post. It was a long post, so no worries if you just missed them.
Hm, doesn’t it need to think in its curriculum I described in the OP?
For further detail, take an arbitrary task with a high skill ceiling and a legible end condition, give it some reward shaping and use self-play if appropriate, and put a diamond at the end and give the agent reward. I agree that even in successful stories, the agent also develops non-diamond shards.
Here’s a consideration for why training might produce an AGI, which I realized after writing the story. Given relevant features, it’s often trivial for even linear models to outperform experts (see Statistical Prediction Rules Out-Perform Expert Human Judgments). What I remember to be a common hypothesis: Human experts are often good at finding features to pay attention to (e.g. patient weight) but bad at setting regression coefficients to come to a decision.
Analogously, consider an SSL+IL initialization in which the AI has imitatively learned sophisticated subroutines for perception, prediction, and action, such that the AI can imitate human-level performance on supervised training distribution (eg navigating mazes). Then PG-style RL finetuning might rearrange and reweight what subroutines to use when, efficiently finding a better subroutine arrangement for decision-making in a range of situations. And thereby doing better than human expert demonstrators.
(Yes, this is sample inefficient, and I didn’t particularly optimize the story for sample efficiency. I focused on telling any story at all which has the desired alignment outcome.)
Why “rather than” instead of “in addition to”? Are you just stating your belief here, or did you mean to argue for it? Maybe you’re saying “It’s hard to get the diamond shard to form properly”, which I agree with and it’s a primary way I expect the story to go wrong. I think that relatively simple interventions will plausibly solve this problem, though, and so consider this more of a research question than a fatal flaw in the training story template.
If I read you properly, that’s not the relevant section. The relevant sections are the next two: The agent prevents value drift and The values handshake. EG I said:
I think there’s a very straightforward case here. In the relevant context, suppose the agent is primarily making decisions on the basis of whether they lead to more or fewer diamonds. The agent considers adopting a reflectively stable utility function which doesn’t produce diamonds. The agent doesn’t choose this plan because it doesn’t lead to diamonds.
I agree that there are ways this can go wrong, some of which you highlight. But the a priori argument makes me expect that, all else equal and conditional on a strong diamond shard at time of values handshake, the agent will probably equilibrate to making lots of diamonds.
I did not claim to be solving the diamond maximization problem, but maybe you wanted to add your own take here? As I wrote in the original post, I think “maximize diamonds” is a seriously mistaken subproblem choice:
I think that “get an agent which reflectively equilibrates to optimizing a single commonly considered quantity like ‘diamonds’” is probably extremely hard and anti-natural. I think MIRI should not have chosen this as a subproblem.
I also think that relaxing the problem by assuming hypercomputation encourages thinking about argmax search, which I think is a subtle but serious trap. For specific generalizable reasons which I’ll soon post about, this design pattern seems basically impossible to align compared to shard agents.
Really? That seems wrong. Suppose that the time of the values handshake, the agent has a strong diamond-shard. I understand you to predict that the agent adopts a reflective utility function which, when optimized, won’t lead to actual diamond. Why? Why wouldn’t the diamond-shard just bid this plan down, because it doesn’t lead to actual diamond?
In addition to my “unbounded/hypercomputing is a red herring” response:
Someone can say “You can reliably solve computer vision tasks by doing deep learning” isn’t telling you how to write superhumanly good features into the vision model, surpassing previous hand-designed expert attempts. They don’t know how the SOTA deep vision models will work internally. And yet it’s still good advice. It’s still telling you something about how to train good vision models.
Similarly, if you’re in a state of ignorance (lethality 19) about how to reliably point any cognitive system to any latent parts of reality, and someone proposes a plan which does plausibly (for specific reasons, not as a vague “it could work” hope) produce an AI which makes lots of real-world diamonds, then that seems like progress to me. (I’m fine agreeing to disagree here, I don’t think it’s productive to dispute how much credit I should get.)
Smaller points:
I think it would make more sense to claim that niceness / other shards are “contingent” instead of “unnatural.” If shard theory is correct, shards are literally natural in that they are found in nature as the predictable outcome of human value formation. Same for niceness.
You call shards “little correlates” and, previously, “ad-hoc internalized correlates.” I don’t know what you intend to communicate by this. The shards are, mechanistically speaking, contextually activated influences on the agent’s decision-making. What information does “ad-hoc” or “little correlate” add to that picture? I’m currently guessing that it expresses your skepticism that shards can cohere into reflectively stable caring?
This is an interesting example. To me, the more relevant questions seem to be: How much evidence is “loudness” (e.g. if I really enjoy something which I do frequently, I sure am more likely to reflectively endorse it compared to if I didn’t enjoy it, even though there are highly available counterexamples to this tendency), and how relevant is this for the diamond story?
EDIT: As I think I wrote in the OP, it’s not enough for a shard to be strongly influencing decision-making in a given context. Especially for an anti-outgroup shard which is unendorsed (eg bids for outcomes which other reflectively aware shards bid against), this shard also seemingly has to be reflectively and broadly activated in order to be retained. So, yeah, if there’s an anti-outgroup shard which gets “maneuvered around and removed” by other shards, sure, that can happen. My takeaway isn’t “anything can get removed for hard-to-understand reasons”, but rather “one particular way shards can get removed is that they directly conflict with other powerful shards.”
I think a diamond-manufacturing subshard would resource-conflict (instrumental conflict, not terminal conflict) with eg a power-seeking subshard (manufacturing diamonds uses energy). Or even against a staple-manufacturing subshard (staples require materials and energy). But I expect the reflective utility function to reflect gains from intershard trade and specialization of different parts of the future resources towards the different decision-making influences (eg maybe one kind of comet is better specialized for making staples, and another kind for diamonds).
Or maybe not. Maybe it goes some other way. But this kind of conflict seems different from anticorrelated terminal value (eg anti-outgroup can impinge on nice-shards, altruism-shards, empathy...) across a shard power imbalance (nonreflective anti-outgroup vs reflective niceness shard).
And my point here isn’t “I have now defused the general class of objection, checkmate!”… It’s still a live and legit worry to me, but I don’t view this phenomenon as not comprehensible, I don’t feel epistemically helpless here (not meaning to make claims about how you feel tbc).
(My take on the reflective stability part of this)
The reflective equilibrium of a shard theoretic agent isn’t a utility function weighted according to each of the shards, it’s a utility function that mostly cares about some extrapolation of the (one or very few) shard(s) that were most tied to the reflective cognition.
It feels like a ‘let’s do science’ or ‘powerseek’ shard would be a lot more privileged, because these shards will be tied to the internal planning structure that ends up doing reflection for the first time.
There’s a huge difference between “Whenever I see ice cream, I have the urge to eat it”, and “Eating ice cream is a fundamentally morally valuable atomic action”. The former roughly describes one of the shards that I have, and the latter is something that I don’t expect to see in my CEV. Similarly, I imagine that a bunch of the safety properties will look more like these urges because the shards will be relatively weak things that are bolted on to the main part of the cognition, not things that bid on the intelligent planning part. The non-reflectively endorsed shards will be seen as arbitrary code that is attached to the mind that the reflectively endorsed shards have to plan around (similar to how I see my “Whenever I see ice cream, I have the urge to eat it” shard.
In other words: there is convergent pressure for CEV-content integrity, but that does not mean that the current way of making decisions (e.g. shards) is close to the CEV optimum, and the shards will choose to self modify to become closer to their CEV.
I don’t feel epistemically helpless here either, and would love a theory of which shards get preserved under reflection.
Assuming shard theory is basically correct, this aspect of Nate’s story can be resolved by viewing self-reflection as a context like any other. If you put the system in a training setup which causes it to self-reflect, and reward it when it comes to the ‘more diamonds’ conclusion, then this should cause it to reflectively want more diamonds.
The only question is, how much does training it to max diamonds in maze finding cause the ‘max diamonds’ shard to be activated while in the self-reflecting context?
Also, notably, it will definitely be doing a modicum of self-reflection during the normal course of training, as the shards which do self-reflection will steer the future towards locations which reinforce their weight.
Also, in OP, you write:
I read a connotation here like “TurnTrout isn’t proposing anything sufficiently new and impressive.” To be clear, I don’t think I’m proposing an awesome new alignment technique. I’m instead proposing that we don’t need one.