Subcortical reinforcement circuits, though, hail from a distinct informational world… and so have to reinforce computations “blindly,” relying only on simple sensory proxies.
This seems to be pointing in an interesting direction that I’d like to see expanded.
Because your subcortical reward circuitry was hardwired by your genome, it’s going to be quite bad at accurately assigning credit to shards.
I don’t know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?
if shard theory is true, meaningful partial alignment successes are possible
“if shard theory is true”—is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?
Even if the human shards only win a small fraction of the blended utility function, a small fraction of our lightcone is quite a lot
What’s to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?
I don’t know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of?
Say that the triggers for pleasure are hardwired. After a pleasurable event, how do only those computations running in the brain that led to pleasure (and not those randomly running computations) get strengthened? After all, the pleasure circuit is hardwired, and can’t reason causally about what thoughts led to what outcomes.
(I’m not currently confident that pleasure is exactly the same thing as reinforcement, but the two are probably closely related, and pleasure is a nice and concrete thing to discuss.)
What’s to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?
Nothing except those shards fighting for their own interests and succeeding to some extent.
You probably have many contending values that you hang on to now, and would even be pretty careful with write access to your own values, for instrumental convergence reasons. If you mostly expect outcomes where one shard eats all the others, why do you have a complex balance of values rather than a single core value?
If you mostly expect outcomes where one shard eats all the others, why do you have a complex balance of values rather than a single core value?
There’s a further question which is “How do people behave when they’re given more power over and understanding of their internal cognitive structures?”, which could actually resolve in “People collapse onto one part of their values.” I just think it won’t resolve that way.
This seems to be pointing in an interesting direction that I’d like to see expanded.
I don’t know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of? Cognitive biases in general?
“if shard theory is true”—is this a question about human intelligence, deep RL agents, or the relationship between the two? How can the hypothesis be tested?
What’s to stop the human shards from being dominated and extinguished by the non-human shards? IE is there reason to expect equilibrium?
Say that the triggers for pleasure are hardwired. After a pleasurable event, how do only those computations running in the brain that led to pleasure (and not those randomly running computations) get strengthened? After all, the pleasure circuit is hardwired, and can’t reason causally about what thoughts led to what outcomes.
(I’m not currently confident that pleasure is exactly the same thing as reinforcement, but the two are probably closely related, and pleasure is a nice and concrete thing to discuss.)
Nothing except those shards fighting for their own interests and succeeding to some extent.
You probably have many contending values that you hang on to now, and would even be pretty careful with write access to your own values, for instrumental convergence reasons. If you mostly expect outcomes where one shard eats all the others, why do you have a complex balance of values rather than a single core value?
There’s a further question which is “How do people behave when they’re given more power over and understanding of their internal cognitive structures?”, which could actually resolve in “People collapse onto one part of their values.” I just think it won’t resolve that way.