To be clear, I don’t think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don’t think they’re all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect “reward for proxy, get an agent which cares about the proxy”; there’s lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so shards which don’t perfectly fit the proxy die. Maybe the proxy is a more natural abstraction than the thing it was proxying for, and the dynamics between shards competing at decision-time are winner-take all. Maybe dynamics between shards are winner-take-all for some other reason, and a shard which captures the proxy will always have at least a small selective advantage. Etc.
Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given “perfect” labelling. This is one thing I meant by “I don’t understand why ‘perfect labeling’ would ensure your shard-formation counterarguments don’t hold.”
If the situation value-distribution is actually exponential in bit-divergence, I’d expect way less wiggle room on value shard formation, because that’s going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
You gestured at some intuitions about that in this comment (which I’m copying below to avoid scrolling to different parts of the thread-tree), and I’d be interested to see more of those intuitions extracted.
I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing)
I have multiple different disagreements with this, and I’m not sure which are relevant yet, so I’ll briefly state a few:
For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
Embedded, reflective heuristic search is not incompatible with argmaxing over one (approximate, implicit) value function; it’s just a particular family of distributed algorithms for argmaxing.
It seems like, in humans, removing a single subshard does catastrophically exit the regime of value. For instance, there’s Eliezer’s argument from the sequences that just removing boredom results in a dystopia.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
My values are also risk-averse (I’d much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in “shard strength” after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.
(This isn’t fully expressing my intuition, here, but I figured I’d say at least a little something to your comment right now)
I’m not going to go into most of the rest now, but:
For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
I think that that does have to do with shards. Liking to drink coffee is the result of a shard, of a contextual influence on decision-making (the influence to drink coffee), and in particular activates in certain situations to pull me into a future in which I drank coffee.
I’m also fine considering “A person who is OK with other people drinking coffee” and anti-C: “a person with otherwise the same values but who isn’t OK with other people drinking coffee.” I think that the latter would inconvenience the former (to the extent that coffee was important to the former), but that they wouldn’t become bitter enemies, that anti-C wouldn’t kill the pro-coffee person because the value function was imperfectly aligned, that the pro-coffee person would still derive substantial value from that universe.
Possibly the anti-coffee value would even be squashed by the rest of anti-C’s values, because the anti-coffee value wasn’t reflectively endorsed by the rest of anti-C’s values. That’s another way in which I think anti-C can be “close enough” and things work out fine.
To be clear, I don’t think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don’t think they’re all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect “reward for proxy, get an agent which cares about the proxy”; there’s lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so shards which don’t perfectly fit the proxy die. Maybe the proxy is a more natural abstraction than the thing it was proxying for, and the dynamics between shards competing at decision-time are winner-take all. Maybe dynamics between shards are winner-take-all for some other reason, and a shard which captures the proxy will always have at least a small selective advantage. Etc.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
You gestured at some intuitions about that in this comment (which I’m copying below to avoid scrolling to different parts of the thread-tree), and I’d be interested to see more of those intuitions extracted.
I have multiple different disagreements with this, and I’m not sure which are relevant yet, so I’ll briefly state a few:
For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
Embedded, reflective heuristic search is not incompatible with argmaxing over one (approximate, implicit) value function; it’s just a particular family of distributed algorithms for argmaxing.
It seems like, in humans, removing a single subshard does catastrophically exit the regime of value. For instance, there’s Eliezer’s argument from the sequences that just removing boredom results in a dystopia.
The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
My values are also risk-averse (I’d much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in “shard strength” after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.
(This isn’t fully expressing my intuition, here, but I figured I’d say at least a little something to your comment right now)
I’m not going to go into most of the rest now, but:
I think that that does have to do with shards. Liking to drink coffee is the result of a shard, of a contextual influence on decision-making (the influence to drink coffee), and in particular activates in certain situations to pull me into a future in which I drank coffee.
I’m also fine considering “A person who is OK with other people drinking coffee” and anti-C: “a person with otherwise the same values but who isn’t OK with other people drinking coffee.” I think that the latter would inconvenience the former (to the extent that coffee was important to the former), but that they wouldn’t become bitter enemies, that anti-C wouldn’t kill the pro-coffee person because the value function was imperfectly aligned, that the pro-coffee person would still derive substantial value from that universe.
Possibly the anti-coffee value would even be squashed by the rest of anti-C’s values, because the anti-coffee value wasn’t reflectively endorsed by the rest of anti-C’s values. That’s another way in which I think anti-C can be “close enough” and things work out fine.