When I first read this comment, I incorrectly understood it to say somehing like “If you were actually trying, you’d have generated the exponential error model on your own; the fact that you didn’t shows that you aren’t properly thinking about old arguments.” I now don’t think that’s what you meant. I think I finally[1] understand what you did mean, and I think you misunderstood what my original comment was trying to say because I wrote poorly and stream-of-consciousness.
Most importantly, I wasn’t saying something like “‘errors’ can’t exist because outer/inner alignment isn’t my frame, ignore.” I meant to communicate the following points:
I don’t know what a “perfect” reward function is in the absence of outer alignment, else I would know how to solve diamond alignment. But I’m happy to just discuss deviations from a proposed labelling scheme. (This is probably what we were already discussing, so this wasn’t meant to be a devastating rejoinder or anything.)
I’m not sure what you mean by the “exponential” model you mentioned elsewhere, or why it would be a fatal flaw if true. Please say more? (Hopefully in a way which makes it clear why your argument behaves differently in the presence of errors, because that would be one way to make your arguments especially legible to how I’m currently thinking about the situation.)
Given my best guess at your model (the exponential error model), I think your original comment seems too optimistic about my specific story (sure seems like exponential weighting would probably just break it, label errors or no) but too pessimistic about the story template (why is it a fatal flaw that can’t be fixed with a bit of additional thinking?).
I meant to ask something like “I don’t fully understand what you’re arguing re: points 1 and 2 (but I have some guesses), and think I disagree about 3; please clarify?” But instead (e.g. by saying things like “my complaint is...”) I perhaps communicated something like “because I don’t understand 2 in my native frame, your argument sucks.” And you were like “Come on, you didn’t even try, you could have totally translated 2. Worrying that you apparently didn’t.”
I think that I left an off-the-cuff comment which might have been appropriate as a Discord message (with real-time clarification), but not as a LessWrong comment. Oops.
Elaborating points 1 and 3 above:
Point 1. In outer/inner, if you “perfectly label” reward events based on whether the agent approaches the diamond, you’re “done” as far as the outer alignment part goes. In order to make the agent actually care about approaching diamonds, we would then turn to inner alignment techniques / ideas. It might make sense to call this labelling “perfect” as far as specifying the outer objective for those scenarios (e.g. when that objective is optimized, the agent actually approaches the diamond).
But if we aren’t aiming for outer/inner alignment, and instead are just considering the (reward schedule) → (inner value composition) mapping, then I worry that my post’s original usage of “perfect” was misleading. On my current frame, a perfect reward schedule would be one which actually gets diamond-values into the agent. The schedule I posted is probably not the best way to do that, even if all goes as planned. I want to be careful not to assume the “perfection” of “+1 when it does in fact approach a real diamond which it can see”, even if I can’t currently point to better alternative reward schedules (e.g. “+x reward in some weird situation”). (This is what I was getting at with “I don’t understand what ‘perfect labeling’ is the thing to talk about, here.”)
What you probably meant by “errors” was “divergences from the reward function outlined in the original post.” This is totally reasonable and important to talk about, but at least I want to clarify for myself and other readers that this is what we’re talking about, and not assuming that my intended reward function was actually “perfect.” (Probably it’s fine to keep talking about “perfect labelling” as long as this point has been made explicit.)
Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given “perfect” labelling. This is one thing I meant by “I don’t understand why ‘perfect labeling’ would ensure your shard-formation counterarguments don’t hold.”
If the situation value-distribution is actually exponential in bit-divergence, I’d expect way less wiggle room on value shard formation, because that’s going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake. (But I’m tentative about all this, haven’t sketched out a concrete failure scenario yet given exponential model! Just a hunch I remember having.)
Again, it was very silly of me to expect my original comment to communicate these points. At the time of writing, I was trying to unpack some promising-seeming feelings and elaborate them over lunch.
My original guess at your complaint was “How could you possibly have not generated the exponential weight hypothesis on your own?”, and I was like what the heck, it’s a hypothesis, sure… but why should I have pinned down that one? What’s wrong with my “linear in error proportion for that kind of situation, exponential in ontology-distance at time of update” hypothesis, why doesn’t that count as a thing-to-have-generated? This was a big part of why I was initially so confused about your complaint.
And then several people said they thought your comment was importantly correct-seeming, and I was like “no way, how can everyone else already have such a developed opinion on exponential vs linear vs something-else here? Surely this is their first time considering the question? Why am I getting flak about not generating that particular hypothesis, how does that prove I’m ‘not trying’ in some important way?”
To be clear, I don’t think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don’t think they’re all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect “reward for proxy, get an agent which cares about the proxy”; there’s lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so shards which don’t perfectly fit the proxy die. Maybe the proxy is a more natural abstraction than the thing it was proxying for, and the dynamics between shards competing at decision-time are winner-take all. Maybe dynamics between shards are winner-take-all for some other reason, and a shard which captures the proxy will always have at least a small selective advantage. Etc.
Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given “perfect” labelling. This is one thing I meant by “I don’t understand why ‘perfect labeling’ would ensure your shard-formation counterarguments don’t hold.”
If the situation value-distribution is actually exponential in bit-divergence, I’d expect way less wiggle room on value shard formation, because that’s going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
You gestured at some intuitions about that in this comment (which I’m copying below to avoid scrolling to different parts of the thread-tree), and I’d be interested to see more of those intuitions extracted.
I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing)
I have multiple different disagreements with this, and I’m not sure which are relevant yet, so I’ll briefly state a few:
For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
Embedded, reflective heuristic search is not incompatible with argmaxing over one (approximate, implicit) value function; it’s just a particular family of distributed algorithms for argmaxing.
It seems like, in humans, removing a single subshard does catastrophically exit the regime of value. For instance, there’s Eliezer’s argument from the sequences that just removing boredom results in a dystopia.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
My values are also risk-averse (I’d much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in “shard strength” after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.
(This isn’t fully expressing my intuition, here, but I figured I’d say at least a little something to your comment right now)
I’m not going to go into most of the rest now, but:
For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
I think that that does have to do with shards. Liking to drink coffee is the result of a shard, of a contextual influence on decision-making (the influence to drink coffee), and in particular activates in certain situations to pull me into a future in which I drank coffee.
I’m also fine considering “A person who is OK with other people drinking coffee” and anti-C: “a person with otherwise the same values but who isn’t OK with other people drinking coffee.” I think that the latter would inconvenience the former (to the extent that coffee was important to the former), but that they wouldn’t become bitter enemies, that anti-C wouldn’t kill the pro-coffee person because the value function was imperfectly aligned, that the pro-coffee person would still derive substantial value from that universe.
Possibly the anti-coffee value would even be squashed by the rest of anti-C’s values, because the anti-coffee value wasn’t reflectively endorsed by the rest of anti-C’s values. That’s another way in which I think anti-C can be “close enough” and things work out fine.
When I first read this comment, I incorrectly understood it to say somehing like “If you were actually trying, you’d have generated the exponential error model on your own; the fact that you didn’t shows that you aren’t properly thinking about old arguments.” I now don’t think that’s what you meant. I think I finally[1] understand what you did mean, and I think you misunderstood what my original comment was trying to say because I wrote poorly and stream-of-consciousness.
Most importantly, I wasn’t saying something like “‘errors’ can’t exist because outer/inner alignment isn’t my frame, ignore.” I meant to communicate the following points:
I don’t know what a “perfect” reward function is in the absence of outer alignment, else I would know how to solve diamond alignment. But I’m happy to just discuss deviations from a proposed labelling scheme. (This is probably what we were already discussing, so this wasn’t meant to be a devastating rejoinder or anything.)
I’m not sure what you mean by the “exponential” model you mentioned elsewhere, or why it would be a fatal flaw if true. Please say more? (Hopefully in a way which makes it clear why your argument behaves differently in the presence of errors, because that would be one way to make your arguments especially legible to how I’m currently thinking about the situation.)
Given my best guess at your model (the exponential error model), I think your original comment seems too optimistic about my specific story (sure seems like exponential weighting would probably just break it, label errors or no) but too pessimistic about the story template (why is it a fatal flaw that can’t be fixed with a bit of additional thinking?).
I meant to ask something like “I don’t fully understand what you’re arguing re: points 1 and 2 (but I have some guesses), and think I disagree about 3; please clarify?” But instead (e.g. by saying things like “my complaint is...”) I perhaps communicated something like “because I don’t understand 2 in my native frame, your argument sucks.” And you were like “Come on, you didn’t even try, you could have totally translated 2. Worrying that you apparently didn’t.”
I think that I left an off-the-cuff comment which might have been appropriate as a Discord message (with real-time clarification), but not as a LessWrong comment. Oops.
Elaborating points 1 and 3 above:
Point 1. In outer/inner, if you “perfectly label” reward events based on whether the agent approaches the diamond, you’re “done” as far as the outer alignment part goes. In order to make the agent actually care about approaching diamonds, we would then turn to inner alignment techniques / ideas. It might make sense to call this labelling “perfect” as far as specifying the outer objective for those scenarios (e.g. when that objective is optimized, the agent actually approaches the diamond).
But if we aren’t aiming for outer/inner alignment, and instead are just considering the (reward schedule) → (inner value composition) mapping, then I worry that my post’s original usage of “perfect” was misleading. On my current frame, a perfect reward schedule would be one which actually gets diamond-values into the agent. The schedule I posted is probably not the best way to do that, even if all goes as planned. I want to be careful not to assume the “perfection” of “+1 when it does in fact approach a real diamond which it can see”, even if I can’t currently point to better alternative reward schedules (e.g. “+x reward in some weird situation”). (This is what I was getting at with “I don’t understand what ‘perfect labeling’ is the thing to talk about, here.”)
What you probably meant by “errors” was “divergences from the reward function outlined in the original post.” This is totally reasonable and important to talk about, but at least I want to clarify for myself and other readers that this is what we’re talking about, and not assuming that my intended reward function was actually “perfect.” (Probably it’s fine to keep talking about “perfect labelling” as long as this point has been made explicit.)
Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given “perfect” labelling. This is one thing I meant by “I don’t understand why ‘perfect labeling’ would ensure your shard-formation counterarguments don’t hold.”
If the situation value-distribution is actually exponential in bit-divergence, I’d expect way less wiggle room on value shard formation, because that’s going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake. (But I’m tentative about all this, haven’t sketched out a concrete failure scenario yet given exponential model! Just a hunch I remember having.)
Again, it was very silly of me to expect my original comment to communicate these points. At the time of writing, I was trying to unpack some promising-seeming feelings and elaborate them over lunch.
My original guess at your complaint was “How could you possibly have not generated the exponential weight hypothesis on your own?”, and I was like what the heck, it’s a hypothesis, sure… but why should I have pinned down that one? What’s wrong with my “linear in error proportion for that kind of situation, exponential in ontology-distance at time of update” hypothesis, why doesn’t that count as a thing-to-have-generated? This was a big part of why I was initially so confused about your complaint.
And then several people said they thought your comment was importantly correct-seeming, and I was like “no way, how can everyone else already have such a developed opinion on exponential vs linear vs something-else here? Surely this is their first time considering the question? Why am I getting flak about not generating that particular hypothesis, how does that prove I’m ‘not trying’ in some important way?”
To be clear, I don’t think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don’t think they’re all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect “reward for proxy, get an agent which cares about the proxy”; there’s lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so shards which don’t perfectly fit the proxy die. Maybe the proxy is a more natural abstraction than the thing it was proxying for, and the dynamics between shards competing at decision-time are winner-take all. Maybe dynamics between shards are winner-take-all for some other reason, and a shard which captures the proxy will always have at least a small selective advantage. Etc.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
You gestured at some intuitions about that in this comment (which I’m copying below to avoid scrolling to different parts of the thread-tree), and I’d be interested to see more of those intuitions extracted.
I have multiple different disagreements with this, and I’m not sure which are relevant yet, so I’ll briefly state a few:
For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
Embedded, reflective heuristic search is not incompatible with argmaxing over one (approximate, implicit) value function; it’s just a particular family of distributed algorithms for argmaxing.
It seems like, in humans, removing a single subshard does catastrophically exit the regime of value. For instance, there’s Eliezer’s argument from the sequences that just removing boredom results in a dystopia.
The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
My values are also risk-averse (I’d much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in “shard strength” after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.
(This isn’t fully expressing my intuition, here, but I figured I’d say at least a little something to your comment right now)
I’m not going to go into most of the rest now, but:
I think that that does have to do with shards. Liking to drink coffee is the result of a shard, of a contextual influence on decision-making (the influence to drink coffee), and in particular activates in certain situations to pull me into a future in which I drank coffee.
I’m also fine considering “A person who is OK with other people drinking coffee” and anti-C: “a person with otherwise the same values but who isn’t OK with other people drinking coffee.” I think that the latter would inconvenience the former (to the extent that coffee was important to the former), but that they wouldn’t become bitter enemies, that anti-C wouldn’t kill the pro-coffee person because the value function was imperfectly aligned, that the pro-coffee person would still derive substantial value from that universe.
Possibly the anti-coffee value would even be squashed by the rest of anti-C’s values, because the anti-coffee value wasn’t reflectively endorsed by the rest of anti-C’s values. That’s another way in which I think anti-C can be “close enough” and things work out fine.