TurnTrout comments on A shot at the diamond-alignment problem

TurnTrout 7 Oct 2022 17:55 UTC
LW: 2 AF: 2
0
AF
First: the agent probably has the concept of diamond from SSL+IL, but that’s different from concepts like producing diamond, approaching diamond (which in turn requires a self-concept or at least a concept of the avatar it’s controlling), etc. During training, those sorts of more-complex concepts are probably built up out of their components (like e.g. “production” and “diamond”); the actual goals or behaviors encoded in a shard have to be built up in whatever “internal language” the agent has from the SSL/IL training.
So the question isn’t “does the agent have the concept of diamond/label?”, the question is how short the relevant “sentences” are in terms of the concepts it has. Neither will be just one “word”.
This is already my model and was intended as part of my communicated reasoning. Why do you think it’s an error in my reasoning? You’ll notice I argued “If diamond”, and about hooking that diamond predicate into its approach-subroutines (learned via IL). (ETA: I don’t think you need a self-model to approach a diamond, or to “value” that in the appropriate sense. To value diamonds being near you, you can have representations of the space nearby, so you need a nearby representation, perhaps.)
label-errors
I think this is not the right term to use, and I think it might be skewing your analysis. This is not a supervised learning regime with exact gradients towards a fixed label. The question is what gets upweighted by the batch PG gradients, batching over the reward events. Let me exaggerate the kind of “error rates” I think you’re anticipating:
- Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.
  - What’s supposed to go wrong? The agent somewhat more strongly steers towards cut gems?
- Suppose I’m grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What’s supposed to happen next?
(If these errors aren’t representative, can you please provide a concrete and plausible scenario?)
- johnswentworth 7 Oct 2022 18:49 UTC
  LW: 6 AF: 6
  0
  AF Parent
  Let me exaggerate the kind of “error rates” I think you’re anticipating:
  Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.
  What’s supposed to go wrong? The agent somewhat more strongly steers towards cut gems?
  Suppose I’m grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What’s supposed to happen next?
  (If these errors aren’t representative, can you please provide a concrete and plausible scenario?)
  Both of these examples are are focused on one error type: the agent does not receive a reward in a situation which we like. That error type is, in general, not very dangerous.
  The error type which is dangerous is for an agent to receive a reward in a situation which we don’t like. For instance, receiving reward in a situation involving a convincing-looking fake diamond. And then a shard which hooks up its behavior to things-which-look-like-diamonds (which is probably at least as natural an abstraction as diamond) gets more weight relative to the diamond-shard, and so when those two shards disagree later the things-which-look-like-diamonds shard wins.
  Note that it would not be at all surprising for the AI to have a prior concept of real-diamonds-or-fake-diamonds-which-are-good-enough-to-fool-most-humans, because that is a cluster of stuff which behaves similarly in many places in the real world—e.g. they’re both used for similar jewelry.
  And sure, you try to kinda patch that by including some correctly-labelled things-which-look-like-diamonds in training, but that only works insofar as they’re sufficiently-obviously-not-diamond that the human labeller can tell (and depends on the ratio of correct to incorrect labels, etc).
  (Also, some moderately uncharitable psychologizing, and I apologize if it’s wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I’d expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)
  What links here?
  - 2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target by TurnTrout (19 Dec 2025 6:09 UTC; 46 points)
  - TurnTrout's comment on A shot at the diamond-alignment problem by TurnTrout (8 Oct 2022 21:00 UTC; 1 point)
  - TurnTrout 15 Oct 2022 23:02 UTC
    LW: 6 AF: 6
    0
    AF Parent
    (Also, some moderately uncharitable psychologizing, and I apologize if it’s wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I’d expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)
    I want to talk about several points related to this topic. I don’t mean to claim that you were making points directly related to all of the below bullet points. This just seems like a good time to look back and assess and see what’s going on for me internally, here. This seems like the obvious spot to leave the analysis.
    At the time of writing, I wasn’t particularly worried about the errors you brought up.
    I am a little more worried now in expectation, both under the currently low-credence worlds where I end up agreeing with your exponential argument, and in the ~linear hypothesis worlds, since I think I can still search harder for worrying examples which IMO neither of us have yet proposed. Therefore I’ll just get a little more pessimistic immediately, in the latter case.
    If I had been way more worried about “reward behavior we should have penalized”, I would have indeed just been less likely to raise the more worrying failure points, but not super less likely. I do assess myself as flawed, here, but not as that flawed.
    I think the typical outcome would be something like “TurnTrout starts typing a list full of weak flaws, notices a twinge of motivated reasoning, has half a minute of internal struggle and then types out the more worrisome errors, and, after a little more internal conflict, says that John has a good point and that he wants to think about it more.”
    I could definitely buy that I wouldn’t be that virtuous, though, and that I would need a bit of external nudging to consider the errors, or else a few more days on my own for the issue to get raised to cognitive-housekeeping. After that happened a few times, I’d notice the overall problem and come up with a plan to fix it.
    Obviously, I have at this point noticed (at least) my counterfactual mistake in the nearby world where I already agreed with you, and therefore have a plan to fix and remove that flaw.
    I think you are right in guessing that I could use more outer/inner heuristics to my advantage, that I am missing a few tools on my belt. Thanks for pointing that out.
    I don’t think that motivated cognition has caused me to catastrophically miss key considerations from e.g. “standard arguments” in a way which has predictably doomed key parts of my reasoning.
    Why I think this: I’ve spent a little while thinking about what the catastrophic error would be, conditional on it existing, and nothing’s coming up for the moment.
    I’d more expect there to be some sequence of slight ways I ignored important clues that other people gave, and where I motivatedly underupdated. But also this is a pretty general failure mode, and I think it’d be pretty silly to call a halt without any positive internal evidence that I actually have done this. (EDIT: In a specific situation which I remember and can correct, as opposed to having a vague sense that yeah I’ve probably done this several times in the last few months. I’ll just keep an eye out.)
    Rather, I think that if I spend three or so days typing up a document, and someone like John Wentworth thinks carefully about it, then that person will surface at least a few considerations I’d missed, more probably using tools not native to my current frame.
    I think a lot of the “Why didn’t you realize the ‘reward for proxy, get an agent which cares about the proxy’?” part is just that John and I just seem to have very different models of SGD dynamics, and that if I had his model, the reasoning which produced the post would have also produced the failure modes John has hypothesized.
    This feels “fine” in that that’s part of the point of sharing my ideas with other people—that smart people will surface new considerations or arguments. This feels “not fine” in the sense that I’d like to not miss considerations, of course.
    This also feels “fine” in that, yes, I wanted to get this essay out before never arrives, and usually I take too long to hit “publish”, and I’m still very happy with the essay overall. I’m fine with other people finding new considerations (e.g. the direct reward for diamond synthesis, or zooming in on how much perfect labelling is required).
    I think that if it turns out there was some crucial existing argument which I did miss, I think I’ll go “huh” but not really be like “wow that hovered at the edge of my cognition but I denied it for motivated reasons.”
    I am way more worried about how much of my daily cognition is still socially motivated, and I do consider that to be a “stop drop and roll”-level fuckup on my part.
    I think there’s not just now-obvious things here like “I get very defensive in public settings in specific situations”, but a range of situations in which I subconsciously aim to persuade or justify my positions, instead of just explaining what I think and why, what I disagree with and why; that some subconscious parts of me look for ways to look good or win an argument; that I have rather low trust in certain ways and that makes it hard for me sometimes; etc.
    I think that I am above-average here, but I have very high standards for myself and consider my current skill in this area to be very inadequate.
    For the record: I welcome well-meaning private feedback on what I might be biased about or messing up. On the other hand, having the feedback be public just pushes some of my buttons in a way which makes the situation hard for me to handle. I aspire for this not to be the case about me. That aspiration is not yet realized.
    I’ve worked hard to make this analysis honest and not optimized to make me look good or less silly. Probably I’ve still failed at least a little. Possibly I’ve missed something important. But this is what I’ve got.
    - johnswentworth 16 Oct 2022 2:44 UTC
      LW: 4 AF: 4
      0
      AF Parent
      Kudos for writing all that out. Part of the reason I left that comment in the first place was because I thought “it’s Turner, if he’s actually motivatedly cognitating here he’ll notice once it’s pointed out”. (And, corollary: since you have the skill to notice when you are motivedly cognitating, I believe you if you say you aren’t. For most people, I do not consider their claims about motivatedness of their own cognition to be much evidence one way or the other.) I do have a fairly high opinion of your skills in that department.
      For the record: I welcome well-meaning private feedback on what I might be biased about or messing up.
      Fair point, that part of my comment probably should have been private. Mea culpa for that.
  - TurnTrout 7 Oct 2022 19:24 UTC
    LW: 4 AF: 4
    0
    AF Parent
    This doesn’t seem dangerous to me. So the agent values both, and there was an event which differentially strengthened the looks-like-diamond shard (assuming the agent could tell the difference at a visual remove, during training), but there are lots of other reward events, many of which won’t really involve that shard (like video games where the agent collects diamonds, or text rpgs where the agent quests for lots of diamonds). (I’m not adding these now, I was imagining this kind of curriculum before, to be clear—see the “game” shard.)
    So maybe there’s a shard with predicates like “would be sensory-perceived by naive people to be a diamond” that gets hit by all of these, but I expect that shard to be relatively gradient starved and relatively complex in the requisite way → not a very substantial update. Not sure why that’s a big problem.
    But I’ll think more and see if I can’t salvage your argument in some form.
    some moderately uncharitable psychologizing
    I found this annoying.
    What links here?
    2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target by TurnTrout (19 Dec 2025 6:09 UTC; 46 points)