Great post! I think it’s very good for alignment researchers to be this level of concrete about their plans, it helps enormously in a bunch of ways e.g. for evaluating the plan.
Comments as I go along:
Why wouldn’t the agent want to just find an adversarial input to its diamond abstraction, which makes it activate unusually strongly? (I think that agents might accidentally do this a bit for optimizer’s curse reasons, but not that strongly. More in an upcoming post.)
Consider why you wouldn’t do this for “hanging out with friends.” Consider the expected consequences of the plan “find an adversarial input to my own evaluation procedure such that I find a plan which future-me maximally evaluates as letting me ‘hang out with my friends’.” I currently predict that such a plan would lead future-me to daydream and not actually hang out with my friends, as present-me evaluates the abstract expected consequences of that plan. My friend-shard doesn’t like that plan, because I’m not hanging out with my friends. So I don’t search for an adversarial input. I infer that I don’t wantto find those inputs becauseI don’t expect those inputs to lead me to actually hang out with my friends a lot as I presently evaluate the abstract-plan consequences.
I don’t think an agent can consider searching for adversarial inputs to its shards without also being reflective, at which point the agent realizes the plan is dumb as evaluated by the current shards assessing the predicted plan-consequences provided by the reflective world-model.
How is the bolded sentence different from the following:
”Consider the expected consequences of the plan “think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision.” I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal’s mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn’t like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more.”
(Basically I’m saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what’s different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)
In particular, even though online self-supervised learning continues to develop the world model and create more advanced concepts, the reward events also keep pinging the invocation of the diamond-abstraction as responsible for reward (because insofar as the agent’s diamond-shard guides its decisions, then the diamond-shard’s diamond-abstraction is in fact responsible for the agent getting reward). The diamond-abstraction gradient starves the AI from exclusively acting on the basis of possible advanced “alien” abstractions which would otherwise have replaced the diamond abstraction. The diamond shard already gets reward effectively, integrating with the rest of the agent’s world model and recurrent state, and therefore provides “job security” for the diamond-abstraction. (And once the agent is smart enough, it will want to preserve its diamond abstraction, insofar as that is necessary for the agent to keep achieving its current goals which involve prototypical-diamonds.)
Are you sure that’s how it works? Seems plausible to me but I’m a bit nervous, I think it could totally turn out to not work like that. (That is, it could turn out that the agent wanting to preserve its diamond abstraction is the only thing that halts the march towards more and more alien-yet-effective abstractions)
Suppose the AI keeps training, but by instrumental convergence, seeking power remains a good idea, and such decisions continually get strengthened. This strengthens the power-seeking shard relative to other shards. Other shards want to prevent this from happening.
you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o’ heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o’ heuristics and rational agents. Namely, shard theory currently basically seems to be saying “At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!” My response is “but what happens in the middle? Seems super important! Also haven’t you just reproduced the problem but inside the head?” (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational agent, but what happens in between was a big and important mystery. Shard theory boldly strides into that dark spot in our model… and then reproduces it in miniature! Progress, I guess.)
“Consider the expected consequences of the plan “think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision.” I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal’s mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn’t like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more.”
(Basically I’m saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what’s different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)
I think there are several things happening. Here are some:
If an EA-to-be (“EAlice”, let’s say) in fact thought that EA would make her waste her life on bullshit, but went ahead anyways, she subjectively made a mistake.
Were her expectations correct? That’s another question. I personally think that AI ruin is real, it’s not low-probability Pascal’s Mugging BS, it’s default-outcome IMO.
I think many EAs are making distortionary values choices.
There is a socially easy way to quash parts of yourself which don’t have immediate sophisticated-sounding arguments backing them up.
But whatever values you do have (eg caring about your family), whatever caring you originally developed (via RL, according to shard theory), didn’t originally come via some grand consequentialist or game-theoretic argument about happiness or freedom.
So why should other values, like “avoiding spiders” or “taking time to relax”, have to justify themselves? They’re part of my utility function, so to speak! That’s not up for grabs!
I care more about my mom than other peoples’ moms. Sue me!
I think a bunch of this has to do with meta-ethics, not with adversarial examples to values.
It might be that your original “helping people” values are not what your old value-distribution would have reflectively endorsed. Like, maybe you were just prioritizing your friends and neighbors, but if you’d ever really thought about it, your reflective strong broadly activated shard coalition would have ruled “hey let’s care about faraway people more.”
EG cooperation + happiness + empathy + fairness + local-helping shard → generalize by creating a global-helping shard
Or maybe EA did in fact trick EAlice and socially pressure and reshape her into a new being for whom this is reflectively endorsed.
EG social shard → global-helping shard
Although EA is in fact selecting for people against whom its arguments constitute (weak) adversarial inputs. I don’t think the selection is that strong? Confused here.
EDIT: One of the main threads is Don’t design agents which exploit adversarial inputs. The point isn’t that people can’t or don’t fall victim to plans which, by virtue of spurious appeal to a person’s value shards, cause the person to unwisely pursue the plan. The point here is that (I claim) intelligent people convergently want to avoid this happening to them.
A diamond-shard will not try to find adversarial inputs to itself. That was my original point, and I think it stands.
I think I agree with everything you said yet still feel confused. My question/objection/issue was not so much “How do you explain people sometimes falling victim to plans which spuriously appeal to their value shards!?!? Checkmate!” but rather “what does it mean for an appeal to be spurious? What is the difference between just thinking long and hard about what to do vs. adversarially selecting a plan that’ll appeal to you? Isn’t the former going to in effect basically equal the latter, thanks to extremal Goodhart? In the limit where you consider all possible plans (maximum optimization power), aren’t they the same?”
Yes, that’s a good question. This is what I’ve been aiming to answer with recent posts.
What is the difference between just thinking long and hard about what to do vs. adversarially selecting a plan that’ll appeal to you? Isn’t the former going to in effect basically equal the latter, thanks to extremal Goodhart? In the limit where you consider all possible plans (maximum optimization power), aren’t they the same?”
(I’m presently confident the answer is “no”, as might be clear from my comments and posts!)
How is the bolded sentence different from the following:
”Consider the expected consequences of the plan “think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision.” I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal’s mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn’t like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more.”
(Basically I’m saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what’s different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)
I think this is a great observation. I thought about it a bit and don’t really find myself worried, based off of some intuitions which I think would take me at least 20 minutes to type up right now, and I really should wrap my commenting up for now. Feel free to ping me if no one else has answered this in a while.
Seems plausible to me but I’m a bit nervous, I think it could totally turn out to not work like that.
Shard theory seems more evidentially supported than bag-o-heuristics theory and rational agent theory, but that’s a pretty low bar! I expect a new theory to come along which is as much of an improvement over shard theory as shard theory is over those.
Re the 5 open questions: Yeah 4 and 5 seem like the hard ones to me.
Anyhow, in conclusion, nice work & I look forward to reading future developments. (Now I’ll go read the other comments)
shard theory currently basically seems to be saying “At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents;[1] they have beliefs and desires of their own, and even negotiate with each other!” My response is “but what happens in the middle? Seems super important! Also haven’t you just reproduced the problem but inside the head?”
when the baby has a proto-world model, the reinforcement learning process takes advantage of that new machinery by further developing the juice-tasting heuristics. Suppose the baby models the room as containing juice within reach but out of sight. Then, the baby happens to turn around, which activates the already-trained reflex heuristic of “grab and drink juice you see in front of you.” In this scenario, “turn around to see the juice” preceded execution of “grab and drink the juice which is in front of me”, and so the baby is reinforced for turning around to grab the juice in situations where the baby models the juice as behind herself.
By this process, repeated many times, the baby learns how to associate world model concepts (e.g. “the juice is behind me”) with the heuristics responsible for reward (e.g. “turn around” and “grab and drink the juice which is in front of me”). Both parts of that sequence are reinforced. In this way, the contextual-heuristics become intertwined with the budding world model.
[...]
While all of this is happening, many different shards of value are also growing, since the human reward system offers a range of feedback signals. Many subroutines are being learned, many heuristics are developing, and many proto-preferences are taking root. At this point, the brain learns a crude planning algorithm, because proto-planning subshards (e.g. IF motor-command-5214 predicted to bring a juice pouch into view, THEN execute) would be reinforced for their contributions to activating the various hardcoded reward circuits. This proto-planning is learnable because most of the machinery was already developed by the self-supervised predictive learning, when e.g. learning to predict the consequences of motor commands (see Appendix A.1).
The planner has to decide on a coherent plan of action. That is, micro-incoherences (turn towards juice, but then turn back towards a friendly adult, but then turn back towards the juice, ad nauseum) should generally be penalized away. Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.
I have some more models beyond what I’ve shared publicly, and eg one of my MATS applicants proposed an interesting story for how the novelty-shard forms, and also proposed one tack of research for answering how value negotiation shakes out (which is admittedly at the end of the gap). But overall I agree that there’s a substantial gap here. I’ve been working on writing out pseudocode for what shard-based reflective planning might look like.
Great post! I think it’s very good for alignment researchers to be this level of concrete about their plans, it helps enormously in a bunch of ways e.g. for evaluating the plan.
Comments as I go along:
How is the bolded sentence different from the following:
”Consider the expected consequences of the plan “think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision.” I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal’s mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn’t like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more.”
(Basically I’m saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what’s different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)
Are you sure that’s how it works? Seems plausible to me but I’m a bit nervous, I think it could totally turn out to not work like that. (That is, it could turn out that the agent wanting to preserve its diamond abstraction is the only thing that halts the march towards more and more alien-yet-effective abstractions)
you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o’ heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o’ heuristics and rational agents. Namely, shard theory currently basically seems to be saying “At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!” My response is “but what happens in the middle? Seems super important! Also haven’t you just reproduced the problem but inside the head?” (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational agent, but what happens in between was a big and important mystery. Shard theory boldly strides into that dark spot in our model… and then reproduces it in miniature! Progress, I guess.)
I think there are several things happening. Here are some:
If an EA-to-be (“EAlice”, let’s say) in fact thought that EA would make her waste her life on bullshit, but went ahead anyways, she subjectively made a mistake.
Were her expectations correct? That’s another question. I personally think that AI ruin is real, it’s not low-probability Pascal’s Mugging BS, it’s default-outcome IMO.
I think many EAs are making distortionary values choices.
There is a socially easy way to quash parts of yourself which don’t have immediate sophisticated-sounding arguments backing them up.
But whatever values you do have (eg caring about your family), whatever caring you originally developed (via RL, according to shard theory), didn’t originally come via some grand consequentialist or game-theoretic argument about happiness or freedom.
So why should other values, like “avoiding spiders” or “taking time to relax”, have to justify themselves? They’re part of my utility function, so to speak! That’s not up for grabs!
I care more about my mom than other peoples’ moms. Sue me!
I agree with much of Self-Integrity and the Drowning Child.
I think a bunch of this has to do with meta-ethics, not with adversarial examples to values.
It might be that your original “helping people” values are not what your old value-distribution would have reflectively endorsed. Like, maybe you were just prioritizing your friends and neighbors, but if you’d ever really thought about it, your reflective strong broadly activated shard coalition would have ruled “hey let’s care about faraway people more.”
EG cooperation + happiness + empathy + fairness + local-helping shard → generalize by creating a global-helping shard
Or maybe EA did in fact trick EAlice and socially pressure and reshape her into a new being for whom this is reflectively endorsed.
EG social shard → global-helping shard
Although EA is in fact selecting for people against whom its arguments constitute (weak) adversarial inputs. I don’t think the selection is that strong? Confused here.
EDIT: One of the main threads is Don’t design agents which exploit adversarial inputs. The point isn’t that people can’t or don’t fall victim to plans which, by virtue of spurious appeal to a person’s value shards, cause the person to unwisely pursue the plan. The point here is that (I claim) intelligent people convergently want to avoid this happening to them.
A diamond-shard will not try to find adversarial inputs to itself. That was my original point, and I think it stands.
I think I agree with everything you said yet still feel confused. My question/objection/issue was not so much “How do you explain people sometimes falling victim to plans which spuriously appeal to their value shards!?!? Checkmate!” but rather “what does it mean for an appeal to be spurious? What is the difference between just thinking long and hard about what to do vs. adversarially selecting a plan that’ll appeal to you? Isn’t the former going to in effect basically equal the latter, thanks to extremal Goodhart? In the limit where you consider all possible plans (maximum optimization power), aren’t they the same?”
Yes, that’s a good question. This is what I’ve been aiming to answer with recent posts.
(I’m presently confident the answer is “no”, as might be clear from my comments and posts!)
OK, guess I’ll go read those posts then...
I think this is a great observation. I thought about it a bit and don’t really find myself worried, based off of some intuitions which I think would take me at least 20 minutes to type up right now, and I really should wrap my commenting up for now. Feel free to ping me if no one else has answered this in a while.
Agreed.
Consider yourself pinged! No rush to reply though.
Shard theory seems more evidentially supported than bag-o-heuristics theory and rational agent theory, but that’s a pretty low bar! I expect a new theory to come along which is as much of an improvement over shard theory as shard theory is over those.
Re the 5 open questions: Yeah 4 and 5 seem like the hard ones to me.
Anyhow, in conclusion, nice work & I look forward to reading future developments. (Now I’ll go read the other comments)
I think the hole is somewhat smaller than you make out, but still substantial. From The shard theory of human values:
I have some more models beyond what I’ve shared publicly, and eg one of my MATS applicants proposed an interesting story for how the novelty-shard forms, and also proposed one tack of research for answering how value negotiation shakes out (which is admittedly at the end of the gap). But overall I agree that there’s a substantial gap here. I’ve been working on writing out pseudocode for what shard-based reflective planning might look like.
I think they aren’t quite best modelled as rational agents, but I’m confused about what axes they are agentic along and what they aren’t.