One of the classic sketches of a utility-maximizing agent (“observation-utility maximizer”, cf Stable Pointers to Value) involves a “utility function box”. A proposal subsystem feeds a possible course-of-action and resulting expected future state of the world into the utility-function box, and the box says whether that plan is a good idea or bad idea. Repeat for many possible proposals, and then the agent executes the best proposal according to the utility-function box.
For the above agent, you’d say red=utility function box, blue=an “epistemic subsystem” within the source code designed to make increasingly accurate predictions of the future state of the world (which, again, are part of what gets fed into the box), black is because power is instrumentally useful towards red, blue is also because knowledge is instrumentally useful towards red (and therefore highly-capable agents do not generally self-modify to sabotage their “epistemic subsystems” but rather if anything self-modify to make them work ever better), white is absent unless red happens to point at it, and green is absent altogether. I think this is related to your observation that “anti-realist rationality struggles to capture” green (and white).
I think the human brain is close to that picture, but with a couple key edits that are key to the algorithms actually working in practice, and those edits bring green into the picture.
The first edit is that the human “utility function box” starts blank/random and is edited over time by a different system (centrally involving the brainstem and “innate drives”). I think this is closely related to how a reward function continually sculpts a value function via TD learning in actor-critic RL. (The correspondence is “value function” ↔ “utility function box”, “reward function” ↔ “a different system”.) (For example, the trained AlphaZero value function “sees the beauty of such-and-such abstract patterns in Go pieces”, so to speak, even though that appreciation wasn’t in the AlphaZero source code; and by the same token the hippie “sees the beauty in the cycle of life”, especially after an acid trip, even if there’s nothing particularly related to that preference in the human genome.) (More details in §9.5 here.)
The second edit is that the human equivalent of a “utility function box” can find anything appealing, not just a possible state of the world in the distant future. (Specifically, it can grow to “like” any latent variable in the current world-model, see The Pointers Problem or §9.2 here.) That’s why humans can find themselves motivated by intuitive rules, virtues, norms, etc., and not just “outcomes”—see my discussion at Consequentialism & Corrigibility.
Putting these two together, we find that (1) humans inevitably have experiences that change our goals and values, (2) it’s possible for humans to come to intrinsically value the existence of such experiences, and thus e.g. write essays about the profound importance of attunement.
Of course, (2) is not the only possibility; another possibility is winding up with an agent that does have preferences about the future state of the world, and where those preferences are sufficiently strong that the agent self-modifies to stop any further (1) (cf. instrumental convergence).
Humans can go both ways about (1). Sometimes we see (1) as bad (I don’t want to get addicted to heroin, I don’t want to get brainwashed, I don’t want to stop caring about my children), and sometimes we see (1) as good (I generally like the idea that my personality and preferences will continue evolving with time and life experience etc.). This essay talks a lot about the latter but doesn’t mention the former (nothing wrong with that, it’s just interesting).
I’m not trying to prove any particular point here, just riffing/chatting.
One other interesting quirk of your model of green is that it appears most of the central (and natural) examples of green for humans involve the utility function box adapting to these stimulating experiences so that their utility function is positively correlated with the way latent variables change over the course of one of an experience. In other words, the utility function gets “attuned” to the result of that experience.
For instance, taking the Zadie Smith example from the essay, her experience of greenness involved starting to appreciate the effect that Mitchell’s music had on her, as opposed to starting to dislike it. Environmentalist greenness, in the same vein, might arise from humans’ utility boxes attuning to the current processes of life, leading to a wish for it to continue.
Notably, I can’t really think of any examples where green alone goes against one of these processes in humans, with most examples of people being “attuned” against an experience being caused by it simply conflicting with a separate, already-existing goal. While disgust does technically conflict with what changes during the process of becoming physically sick, I can’t think of any reason that might occur other than how it prevents a human from achieving goals (black, or perhaps red). Desires for immortality, while conflicting with the process of death, seem to mostly extend from a red desire to have things continue to live (which would, by this model, was a desire that stemmed itself from green). If I attempt to think of some experience that would be completely uncorrelated with a human’s prior preferences (e.g. an infant looking at a flowing river for the first time from a distance), it doesn’t seem natural to imagine the human suddenly disliking that in any particular circumstance (the infant wouldn’t start despising flowing rivers), but I could still see a small chance of it beginning to appreciate it (as long as I’m not missing some obvious counterexample).
This natural positive correlation (or “attunement”, “appreciation”, or for especially spicy takes, “alignment”), if I had to guess, could be explained from either humans simply gaining reward from expanding their world models (maybe this just simplifies to “humans naturally like learning”, but that feels a little anti-climactically blue). It is also possible that attunement as opposed to indifference is only created just by some minor positive association generated from deeper levels in the brain, although that would imply that green could just be a consequence of red for a human’s utility function box.
For example, if you didn’t know that walking near a wasp nest is a bad idea, and then you do so, then I guess you could say “some part of the world comes forward … strangely new, and shining with meaning”, because from now on into the future, whenever you see a wasp nest, it will pop out with a new salient meaning “Gah those things suck”.
You wouldn’t use the word “attunement” for that obviously. “Attunement” is one of those words that can only refer to good things by definition, just as the word “contamination” can only refer to bad things by definition (detailed discussion here).
I gesture vaguely at Morality as Fixed Computation, moral realism, utility-function blackboxes, Learning What to Value, discerning an initially-uncertain utility function that may not be straightforward to get definitive information about.
One of the classic sketches of a utility-maximizing agent (“observation-utility maximizer”, cf Stable Pointers to Value) involves a “utility function box”. A proposal subsystem feeds a possible course-of-action and resulting expected future state of the world into the utility-function box, and the box says whether that plan is a good idea or bad idea. Repeat for many possible proposals, and then the agent executes the best proposal according to the utility-function box.
For the above agent, you’d say red=utility function box, blue=an “epistemic subsystem” within the source code designed to make increasingly accurate predictions of the future state of the world (which, again, are part of what gets fed into the box), black is because power is instrumentally useful towards red, blue is also because knowledge is instrumentally useful towards red (and therefore highly-capable agents do not generally self-modify to sabotage their “epistemic subsystems” but rather if anything self-modify to make them work ever better), white is absent unless red happens to point at it, and green is absent altogether. I think this is related to your observation that “anti-realist rationality struggles to capture” green (and white).
I think the human brain is close to that picture, but with a couple key edits that are key to the algorithms actually working in practice, and those edits bring green into the picture.
The first edit is that the human “utility function box” starts blank/random and is edited over time by a different system (centrally involving the brainstem and “innate drives”). I think this is closely related to how a reward function continually sculpts a value function via TD learning in actor-critic RL. (The correspondence is “value function” ↔ “utility function box”, “reward function” ↔ “a different system”.) (For example, the trained AlphaZero value function “sees the beauty of such-and-such abstract patterns in Go pieces”, so to speak, even though that appreciation wasn’t in the AlphaZero source code; and by the same token the hippie “sees the beauty in the cycle of life”, especially after an acid trip, even if there’s nothing particularly related to that preference in the human genome.) (More details in §9.5 here.)
The second edit is that the human equivalent of a “utility function box” can find anything appealing, not just a possible state of the world in the distant future. (Specifically, it can grow to “like” any latent variable in the current world-model, see The Pointers Problem or §9.2 here.) That’s why humans can find themselves motivated by intuitive rules, virtues, norms, etc., and not just “outcomes”—see my discussion at Consequentialism & Corrigibility.
Putting these two together, we find that (1) humans inevitably have experiences that change our goals and values, (2) it’s possible for humans to come to intrinsically value the existence of such experiences, and thus e.g. write essays about the profound importance of attunement.
Of course, (2) is not the only possibility; another possibility is winding up with an agent that does have preferences about the future state of the world, and where those preferences are sufficiently strong that the agent self-modifies to stop any further (1) (cf. instrumental convergence).
Humans can go both ways about (1). Sometimes we see (1) as bad (I don’t want to get addicted to heroin, I don’t want to get brainwashed, I don’t want to stop caring about my children), and sometimes we see (1) as good (I generally like the idea that my personality and preferences will continue evolving with time and life experience etc.). This essay talks a lot about the latter but doesn’t mention the former (nothing wrong with that, it’s just interesting).
I’m not trying to prove any particular point here, just riffing/chatting.
One other interesting quirk of your model of green is that it appears most of the central (and natural) examples of green for humans involve the utility function box adapting to these stimulating experiences so that their utility function is positively correlated with the way latent variables change over the course of one of an experience. In other words, the utility function gets “attuned” to the result of that experience.
For instance, taking the Zadie Smith example from the essay, her experience of greenness involved starting to appreciate the effect that Mitchell’s music had on her, as opposed to starting to dislike it. Environmentalist greenness, in the same vein, might arise from humans’ utility boxes attuning to the current processes of life, leading to a wish for it to continue.
Notably, I can’t really think of any examples where green alone goes against one of these processes in humans, with most examples of people being “attuned” against an experience being caused by it simply conflicting with a separate, already-existing goal. While disgust does technically conflict with what changes during the process of becoming physically sick, I can’t think of any reason that might occur other than how it prevents a human from achieving goals (black, or perhaps red). Desires for immortality, while conflicting with the process of death, seem to mostly extend from a red desire to have things continue to live (which would, by this model, was a desire that stemmed itself from green). If I attempt to think of some experience that would be completely uncorrelated with a human’s prior preferences (e.g. an infant looking at a flowing river for the first time from a distance), it doesn’t seem natural to imagine the human suddenly disliking that in any particular circumstance (the infant wouldn’t start despising flowing rivers), but I could still see a small chance of it beginning to appreciate it (as long as I’m not missing some obvious counterexample).
This natural positive correlation (or “attunement”, “appreciation”, or for especially spicy takes, “alignment”), if I had to guess, could be explained from either humans simply gaining reward from expanding their world models (maybe this just simplifies to “humans naturally like learning”, but that feels a little anti-climactically blue). It is also possible that attunement as opposed to indifference is only created just by some minor positive association generated from deeper levels in the brain, although that would imply that green could just be a consequence of red for a human’s utility function box.
For example, if you didn’t know that walking near a wasp nest is a bad idea, and then you do so, then I guess you could say “some part of the world comes forward … strangely new, and shining with meaning”, because from now on into the future, whenever you see a wasp nest, it will pop out with a new salient meaning “Gah those things suck”.
You wouldn’t use the word “attunement” for that obviously. “Attunement” is one of those words that can only refer to good things by definition, just as the word “contamination” can only refer to bad things by definition (detailed discussion here).
I gesture vaguely at Morality as Fixed Computation, moral realism, utility-function blackboxes, Learning What to Value, discerning an initially-uncertain utility function that may not be straightforward to get definitive information about.