johnswentworth comments on A shot at the diamond-alignment problem

johnswentworth 7 Oct 2022 19:03 UTC
LW: 13 AF: 7
5
AF
I think I have a complaint like “You seem to be comparing to a ‘perfect’ reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn’t make sense.
I think this is close to our most core crux.
It seems to me that there are a bunch of standard arguments which you are ignoring because they’re formulated in an old frame that you’re trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you’ve instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.
Like, if I have a reward signal that rewards X, then the old frame would say “alright, so the agent will optimize for X”. And you’re like “nope, that whole form of argument is invalid, hit ignore button”. But in fact it is usually very easy to take that argument and unpack it into something like “X has a short description in terms of natural abstractions, so starting from a base model and giving a feedback signal we should rapidly see some X-shards show up, and then the shards which best match X will be reinforced to exponentially higher weight (with respect to the bit-divergence between their proxy X’ and the actual X)”. And it seems like you are not even attempting to perform that translation, which I find very frustrating because I’m pretty sure you know this stuff plenty well to do it.
What links here?
- TurnTrout's comment on A shot at the diamond-alignment problem by TurnTrout (15 Oct 2022 23:02 UTC; 6 points)
- TurnTrout 16 Dec 2022 22:55 UTC
  LW: 7 AF: 5
  2
  AF Parent
  It seems to me that there are a bunch of standard arguments which you are ignoring because they’re formulated in an old frame that you’re trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you’ve instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.
  I agree that we may need to be quite skillful in providing “good”/carefully considered reward signals on the data distribution actually fed to the AI. (I also think it’s possible we have substantial degrees of freedom there.) In this sense, we might need to give “robustly” good feedback.
  However, one intuition which I hadn’t properly communicated was: to make OP’s story go well, we don’t need e.g. an outer objective which robustly grades every plan or sequence of events the AI could imagine, such that optimizing that objective globally produces good results. This isn’t just good reward signals on data distribution (e.g. real vs fake diamonds), this is non-upwards-error reward signals in all AI-imaginable situations, which seems thoroughly doomed to me. And this story avoids at least that problem, which I am relieved by. (And my current guess is that this “robust grading” problem doesn’t just reappear elsewhere, although I think there are still a range of other difficult problems remaining. See also my post Alignment allows “nonrobust” decision-influences and doesn’t require robust grading.)
  And so I might have been saying “Hey isn’t this cool we can avoid the worst parts of Goodhart by exiting outer/inner as a frame” while thinking of the above intuition (but not communicating it explicitly, because I didn’t have that sufficient clarity as yet). But maybe you reacted ”??? how does this avoid the need to reliably grade on-distribution situations, it’s totally nontrivial to do that and it seems quite probable that we have to.” Both seem true to me!
  (I’m not saying this was the whole of our disagreement, but it seems like a relevant guess.)
- TurnTrout 15 Oct 2022 21:57 UTC
  LW: 6 AF: 6
  0
  AF Parent
  When I first read this comment, I incorrectly understood it to say somehing like “If you were actually trying, you’d have generated the exponential error model on your own; the fact that you didn’t shows that you aren’t properly thinking about old arguments.” I now don’t think that’s what you meant. I think I finally^[1] understand what you did mean, and I think you misunderstood what my original comment was trying to say because I wrote poorly and stream-of-consciousness.
  Most importantly, I wasn’t saying something like “‘errors’ can’t exist because outer/inner alignment isn’t my frame, ignore.” I meant to communicate the following points:
  1. I don’t know what a “perfect” reward function is in the absence of outer alignment, else I would know how to solve diamond alignment. But I’m happy to just discuss deviations from a proposed labelling scheme. (This is probably what we were already discussing, so this wasn’t meant to be a devastating rejoinder or anything.)
  2. I’m not sure what you mean by the “exponential” model you mentioned elsewhere, or why it would be a fatal flaw if true. Please say more? (Hopefully in a way which makes it clear why your argument behaves differently in the presence of errors, because that would be one way to make your arguments especially legible to how I’m currently thinking about the situation.)
  3. Given my best guess at your model (the exponential error model), I think your original comment seems too optimistic about my specific story (sure seems like exponential weighting would probably just break it, label errors or no) but too pessimistic about the story template (why is it a fatal flaw that can’t be fixed with a bit of additional thinking?).
  I meant to ask something like “I don’t fully understand what you’re arguing re: points 1 and 2 (but I have some guesses), and think I disagree about 3; please clarify?” But instead (e.g. by saying things like “my complaint is...”) I perhaps communicated something like “because I don’t understand 2 in my native frame, your argument sucks.” And you were like “Come on, you didn’t even try, you could have totally translated 2. Worrying that you apparently didn’t.”
  I think that I left an off-the-cuff comment which might have been appropriate as a Discord message (with real-time clarification), but not as a LessWrong comment. Oops.
  Elaborating points 1 and 3 above:
  Point 1. In outer/inner, if you “perfectly label” reward events based on whether the agent approaches the diamond, you’re “done” as far as the outer alignment part goes. In order to make the agent actually care about approaching diamonds, we would then turn to inner alignment techniques / ideas. It might make sense to call this labelling “perfect” as far as specifying the outer objective for those scenarios (e.g. when that objective is optimized, the agent actually approaches the diamond).
  But if we aren’t aiming for outer/inner alignment, and instead are just considering the (reward schedule) → (inner value composition) mapping, then I worry that my post’s original usage of “perfect” was misleading. On my current frame, a perfect reward schedule would be one which actually gets diamond-values into the agent. The schedule I posted is probably not the best way to do that, even if all goes as planned. I want to be careful not to assume the “perfection” of “+1 when it does in fact approach a real diamond which it can see”, even if I can’t currently point to better alternative reward schedules (e.g. “+x reward in some weird situation”). (This is what I was getting at with “I don’t understand what ‘perfect labeling’ is the thing to talk about, here.”)
  What you probably meant by “errors” was “divergences from the reward function outlined in the original post.” This is totally reasonable and important to talk about, but at least I want to clarify for myself and other readers that this is what we’re talking about, and not assuming that my intended reward function was actually “perfect.” (Probably it’s fine to keep talking about “perfect labelling” as long as this point has been made explicit.)
  Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given “perfect” labelling. This is one thing I meant by “I don’t understand why ‘perfect labeling’ would ensure your shard-formation counterarguments don’t hold.”
  If the situation value-distribution is actually exponential in bit-divergence, I’d expect way less wiggle room on value shard formation, because that’s going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake. (But I’m tentative about all this, haven’t sketched out a concrete failure scenario yet given exponential model! Just a hunch I remember having.)
  Again, it was very silly of me to expect my original comment to communicate these points. At the time of writing, I was trying to unpack some promising-seeming feelings and elaborate them over lunch.
  1. ^
    My original guess at your complaint was “How could you possibly have not generated the exponential weight hypothesis on your own?”, and I was like what the heck, it’s a hypothesis, sure… but why should I have pinned down that one? What’s wrong with my “linear in error proportion for that kind of situation, exponential in ontology-distance at time of update” hypothesis, why doesn’t that count as a thing-to-have-generated? This was a big part of why I was initially so confused about your complaint.
    
    And then several people said they thought your comment was importantly correct-seeming, and I was like “no way, how can everyone else already have such a developed opinion on exponential vs linear vs something-else here? Surely this is their first time considering the question? Why am I getting flak about not generating that particular hypothesis, how does that prove I’m ‘not trying’ in some important way?”
  - johnswentworth 16 Oct 2022 17:29 UTC
    LW: 7 AF: 6
    0
    AF Parent
    To be clear, I don’t think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don’t think they’re all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect “reward for proxy, get an agent which cares about the proxy”; there’s lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so shards which don’t perfectly fit the proxy die. Maybe the proxy is a more natural abstraction than the thing it was proxying for, and the dynamics between shards competing at decision-time are winner-take all. Maybe dynamics between shards are winner-take-all for some other reason, and a shard which captures the proxy will always have at least a small selective advantage. Etc.
    Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given “perfect” labelling. This is one thing I meant by “I don’t understand why ‘perfect labeling’ would ensure your shard-formation counterarguments don’t hold.”
    If the situation value-distribution is actually exponential in bit-divergence, I’d expect way less wiggle room on value shard formation, because that’s going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake.
    It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
    You gestured at some intuitions about that in this comment (which I’m copying below to avoid scrolling to different parts of the thread-tree), and I’d be interested to see more of those intuitions extracted.
    I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
    I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing)
    I have multiple different disagreements with this, and I’m not sure which are relevant yet, so I’ll briefly state a few:
    For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
    Embedded, reflective heuristic search is not incompatible with argmaxing over one (approximate, implicit) value function; it’s just a particular family of distributed algorithms for argmaxing.
    It seems like, in humans, removing a single subshard does catastrophically exit the regime of value. For instance, there’s Eliezer’s argument from the sequences that just removing boredom results in a dystopia.
    What links here?
    Alignment allows “nonrobust” decision-influences and doesn’t require robust grading by TurnTrout (29 Nov 2022 6:23 UTC; 62 points)
    Rohin Shah's comment on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading by TurnTrout (29 Nov 2022 15:56 UTC; 3 points)
    - TurnTrout 26 Oct 2022 16:16 UTC
      LW: 4 AF: 4
      0
      AF Parent
      It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
      The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
      My values are also risk-averse (I’d much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in “shard strength” after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.
      (This isn’t fully expressing my intuition, here, but I figured I’d say at least a little something to your comment right now)
      I’m not going to go into most of the rest now, but:
      For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
      I think that that does have to do with shards. Liking to drink coffee is the result of a shard, of a contextual influence on decision-making (the influence to drink coffee), and in particular activates in certain situations to pull me into a future in which I drank coffee.
      I’m also fine considering “A person who is OK with other people drinking coffee” and anti-C: “a person with otherwise the same values but who isn’t OK with other people drinking coffee.” I think that the latter would inconvenience the former (to the extent that coffee was important to the former), but that they wouldn’t become bitter enemies, that anti-C wouldn’t kill the pro-coffee person because the value function was imperfectly aligned, that the pro-coffee person would still derive substantial value from that universe.
      Possibly the anti-coffee value would even be squashed by the rest of anti-C’s values, because the anti-coffee value wasn’t reflectively endorsed by the rest of anti-C’s values. That’s another way in which I think anti-C can be “close enough” and things work out fine.
      What links here?
      TurnTrout's comment on Why The Focus on Expected Utility Maximisers? by DragonGod (29 Dec 2022 0:29 UTC; 2 points)
- TurnTrout 8 Oct 2022 21:00 UTC
  LW: 1 AF: 3
  0
  AF Parent
  EDIT 2: The original comment was too harsh. I’ve struck the original below. Here is what I think I should have said:
  I think you raise a valuable object-level point here, which I haven’t yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I’d appreciate if you wouldn’t speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts.
  ~~Warning: This comment, and~~ ~~your previous comment~~, violate my comment section guidelines: “Reign of terror // Be charitable.” You have made and publicly stated a range of unnecessary, unkind, and untrue inferences about my thinking process. You have also made non-obvious-to-me claims of questionable-to-me truth value, which you also treat as exceedingly obvious. Please edit these two comments to conform to my civility guidelines.
  ~~(EDIT: Thanks. I look forward to resuming object-level discussion!)~~
  - TurnTrout 11 Oct 2022 4:25 UTC
    LW: 12 AF: 9
    4
    AF Parent
    After more reflection, I now think that this moderation comment was too harsh. First, the parts I think I should have done differently:
    Realized that who reads commenting guidelines anyways, let alone expects them to be enforced?
    Realized that it’s probably ambiguous what counts as “charitable” or not, even though (illusion of transparency) it felt so obvious to me that this counted as “not that.”
    Realized that predictably I would later consider the incident to be less upsetting than in the moment, and that John may not have been aware that I find this kind of situation unusually upsetting.
    Therefore, I should have said something like “I think you raise a valuable object-level point here, which I haven’t yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I’d appreciate if you wouldn’t speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts.”
    I’m striking the original warning, putting in (4), and I encourage John to unredact his comments (but that’s up to him).
    I’ve thought more about what my policy should be going forward. What kind of space do I want my comment section to be? First, I want to be able to say “This seems wrong, and here’s why”, and other people can say the same back to me, and one or more of us can end up at the truth faster. Second, it’s also important that people know that, going forward, engaging with me in (what feels to them like) good-faith will not be randomly slapped with a moderation warning because they annoyed me.
    Third, I want to feel comfortable in my interactions in my comment section. My current plan is:
    If someone comment something which feels personally uncharitable to me (a rather rare occurrence, what with the hundreds of comments in the last year since this kind of situation last happened), then I’ll privately message them, explain my guidelines, and ask that they tweak tone / write more on the object-level / not do the particular thing.^[1]
    If necessary, I’ll also write a soft-ask (like (4) above) as a comment.
    In cases where this is just getting ignored and the person is being antagonistic, I will indeed post a starker warning and then possibly just delete comments.
    ^
    I had spoken with John privately before posting the warning comment. I think my main mistake was jumping to (3) instead of doing more of (1) and (2).
  - habryka 9 Oct 2022 4:46 UTC
    LW: 6 AF: 3
    2
    AF Parent
    Oh, huh, I think this moderation action makes me substantially less likely to comment further on your posts, FWIW. It’s currently will within your rights to do so, and I am on the margin excited about more people moderating things, but I feel hesitant participating with the current level of norm-specification + enforcement.
    
    I also turned my strong-upvote into a small-upvote, since I have less trust in the comment section surfacing counterarguments, which feels particularly sad for this post (e.g. I was planning to respond to your comment with examples of past arguments this post is ignoring, but am now unlikely to do so).
    
    Again, I think that’s fine, but I think posts with idiosyncratic norm enforcement should get less exposure, or at least not be canonical references. Historically we’ve decided to not put posts on frontpage when they had particularly idiosyncratic norm enforcement. I think that’s the wrong call here, but not confident.
  - Zack_M_Davis 11 Oct 2022 3:19 UTC
    4 points
    0
    Parent
    Sorry, I’m confused; for my own education, can you explain why these civility guidelines aren’t epistemically suicidal? Personally, I want people like John Wentworth to comment on my posts to tell me their inferences about my thinking process; moreover, controlling for quality, “unkind” inferences are better, because I learn more from people telling me what I’m doing wrong, than from people telling me what I’m already doing right. What am I missing? Please be unkind.