Very nice people don’t usually search for maximally-nice outcomes — they don’t consider plans like “killing my really mean neighbor so as to increase average niceness over time.” I think there are a range of reasons for this plan not being generated. Here’s one.
Consider a person with a niceness-shard. This might look like an aggregation of subshards/subroutines like “if person nearby and person.state==sad, sample plan generator for ways to make them happy” and “bid upwards on plans which lead to people being happier and more respectful, according to my world model.” In mental contexts where this shard is very influential, it would have a large influence on the planning process.
However, people are not just made up of a grader and a plan-generator/actor — they are not just “the plan-generating part” and “the plan-grading part.” The next sampled plan modification, the next internal-monologue-thought to have—these are influenced and steered by e.g. the nice-shard. If the next macrostep of reasoning is about e.g. hurting people, well — the niceness shard is activated, and will bid down on this.
The niceness shard isn’t just bidding over outcomes, it’s bidding on next thoughts (on my understanding of how this works). And so these thoughts would get bid down, and the thought being painful to consider leads to a slight negative reinforcement event. This means that violent plan-modifications are eventually not sampled at all in contexts where the niceness shard would bid them away.
So nice people aren’t just searching for “nice outcomes.” They’re nicely searching for nice outcomes.
(This breaks the infinite regress of saying “there’s a utility function over niceness, and preferences over how to think next thoughts about niceness, and meta-preferences about how to have preferences about next thoughts...” — eventually cognition must ground out in small computations which are not themselves utility functions or preferences or maximization!)
Thanks to discussions with Peli Grietzer about this idea of his (“praxis values”). Praxis values involve “doing X X-ingly.” I hope he publishes his thoughts here soon, because I’ve found them enlightening.
The niceness shard isn’t just bidding over outcomes, it’s bidding on next thoughts (on my understanding of how this works). And so these thoughts would get bid down
Seems similar to how I conceptualize this paper’s approach to controlling text generation models using gradients from classifiers. You can think of the niceness shard as implementing a classifier for “is this plan nice?”, and updating the latent planning state in directions that make the classifier more inclined to say “yes”.
The linked paper does a similar process, but using a trained classifier, actual gradient descent, and updates LM token representations. Of particular note is the fact that the classifiers used in the paper are pretty weak (~500 training examples), and not at all adversarially robust. It still works for controlling text generation.
I wonder if inserting shards into an AI is really just that straightforward?
But I guess that instrumental convergence will still eventually lead to either
all shards acquiring more and more instrumental structure (neuronal weights within shards getting optimized for that), or
shards that are directly instrumental will take more and more weight overall.
One can see that in regular human adult development. The heuristics children use are simpler and more of the type “searching for nice things in nice ways” or even seeing everything thru a niceness lens. While adults have more pure strategies, e.g., planning as a shard of its own. Most humans just die before they reach convergence. And there are probably also other aspects. Enlightenment may be a state where pure shards become an option.
Very nice people don’t usually search for maximally-nice outcomes — they don’t consider plans like “killing my really mean neighbor so as to increase average niceness over time.” I think there are a range of reasons for this plan not being generated. Here’s one.
Consider a person with a niceness-shard. This might look like an aggregation of subshards/subroutines like “if
person nearby
andperson.state==sad
, sample plan generator for ways to make them happy” and “bid upwards on plans which lead to people being happier and more respectful, according to my world model.” In mental contexts where this shard is very influential, it would have a large influence on the planning process.However, people are not just made up of a grader and a plan-generator/actor — they are not just “the plan-generating part” and “the plan-grading part.” The next sampled plan modification, the next internal-monologue-thought to have—these are influenced and steered by e.g. the nice-shard. If the next macrostep of reasoning is about e.g. hurting people, well — the niceness shard is activated, and will bid down on this.
The niceness shard isn’t just bidding over outcomes, it’s bidding on next thoughts (on my understanding of how this works). And so these thoughts would get bid down, and the thought being painful to consider leads to a slight negative reinforcement event. This means that violent plan-modifications are eventually not sampled at all in contexts where the niceness shard would bid them away.
So nice people aren’t just searching for “nice outcomes.” They’re nicely searching for nice outcomes.
(This breaks the infinite regress of saying “there’s a utility function over niceness, and preferences over how to think next thoughts about niceness, and meta-preferences about how to have preferences about next thoughts...” — eventually cognition must ground out in small computations which are not themselves utility functions or preferences or maximization!)
Thanks to discussions with Peli Grietzer about this idea of his (“praxis values”). Praxis values involve “doing X X-ingly.” I hope he publishes his thoughts here soon, because I’ve found them enlightening.
Seems similar to how I conceptualize this paper’s approach to controlling text generation models using gradients from classifiers. You can think of the niceness shard as implementing a classifier for “is this plan nice?”, and updating the latent planning state in directions that make the classifier more inclined to say “yes”.
The linked paper does a similar process, but using a trained classifier, actual gradient descent, and updates LM token representations. Of particular note is the fact that the classifiers used in the paper are pretty weak (~500 training examples), and not at all adversarially robust. It still works for controlling text generation.
I wonder if inserting shards into an AI is really just that straightforward?
But I guess that instrumental convergence will still eventually lead to either
all shards acquiring more and more instrumental structure (neuronal weights within shards getting optimized for that), or
shards that are directly instrumental will take more and more weight overall.
One can see that in regular human adult development. The heuristics children use are simpler and more of the type “searching for nice things in nice ways” or even seeing everything thru a niceness lens. While adults have more pure strategies, e.g., planning as a shard of its own. Most humans just die before they reach convergence. And there are probably also other aspects. Enlightenment may be a state where pure shards become an option.