TurnTrout comments on A shot at the diamond-alignment problem

TurnTrout 15 Oct 2022 23:02 UTC
LW: 6 AF: 6
0
AF
(Also, some moderately uncharitable psychologizing, and I apologize if it’s wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I’d expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)
I want to talk about several points related to this topic. I don’t mean to claim that you were making points directly related to all of the below bullet points. This just seems like a good time to look back and assess and see what’s going on for me internally, here. This seems like the obvious spot to leave the analysis.
- At the time of writing, I wasn’t particularly worried about the errors you brought up.
  - I am a little more worried now in expectation, both under the currently low-credence worlds where I end up agreeing with your exponential argument, and in the ~linear hypothesis worlds, since I think I can still search harder for worrying examples which IMO neither of us have yet proposed. Therefore I’ll just get a little more pessimistic immediately, in the latter case.
- If I had been way more worried about “reward behavior we should have penalized”, I would have indeed just been less likely to raise the more worrying failure points, but not super less likely. I do assess myself as flawed, here, but not as that flawed.
  - I think the typical outcome would be something like “TurnTrout starts typing a list full of weak flaws, notices a twinge of motivated reasoning, has half a minute of internal struggle and then types out the more worrisome errors, and, after a little more internal conflict, says that John has a good point and that he wants to think about it more.”
  - I could definitely buy that I wouldn’t be that virtuous, though, and that I would need a bit of external nudging to consider the errors, or else a few more days on my own for the issue to get raised to cognitive-housekeeping. After that happened a few times, I’d notice the overall problem and come up with a plan to fix it.
  - Obviously, I have at this point noticed (at least) my counterfactual mistake in the nearby world where I already agreed with you, and therefore have a plan to fix and remove that flaw.
- I think you are right in guessing that I could use more outer/inner heuristics to my advantage, that I am missing a few tools on my belt. Thanks for pointing that out.
- I don’t think that motivated cognition has caused me to catastrophically miss key considerations from e.g. “standard arguments” in a way which has predictably doomed key parts of my reasoning.
  - Why I think this: I’ve spent a little while thinking about what the catastrophic error would be, conditional on it existing, and nothing’s coming up for the moment.
    I’d more expect there to be some sequence of slight ways I ignored important clues that other people gave, and where I motivatedly underupdated. But also this is a pretty general failure mode, and I think it’d be pretty silly to call a halt without any positive internal evidence that I actually have done this. (EDIT: In a specific situation which I remember and can correct, as opposed to having a vague sense that yeah I’ve probably done this several times in the last few months. I’ll just keep an eye out.)
  - Rather, I think that if I spend three or so days typing up a document, and someone like John Wentworth thinks carefully about it, then that person will surface at least a few considerations I’d missed, more probably using tools not native to my current frame.
    I think a lot of the “Why didn’t you realize the ‘reward for proxy, get an agent which cares about the proxy’?” part is just that John and I just seem to have very different models of SGD dynamics, and that if I had his model, the reasoning which produced the post would have also produced the failure modes John has hypothesized.
    This feels “fine” in that that’s part of the point of sharing my ideas with other people—that smart people will surface new considerations or arguments. This feels “not fine” in the sense that I’d like to not miss considerations, of course.
    This also feels “fine” in that, yes, I wanted to get this essay out before never arrives, and usually I take too long to hit “publish”, and I’m still very happy with the essay overall. I’m fine with other people finding new considerations (e.g. the direct reward for diamond synthesis, or zooming in on how much perfect labelling is required).
  - I think that if it turns out there was some crucial existing argument which I did miss, I think I’ll go “huh” but not really be like “wow that hovered at the edge of my cognition but I denied it for motivated reasons.”
- I am way more worried about how much of my daily cognition is still socially motivated, and I do consider that to be a “stop drop and roll”-level fuckup on my part.
  - I think there’s not just now-obvious things here like “I get very defensive in public settings in specific situations”, but a range of situations in which I subconsciously aim to persuade or justify my positions, instead of just explaining what I think and why, what I disagree with and why; that some subconscious parts of me look for ways to look good or win an argument; that I have rather low trust in certain ways and that makes it hard for me sometimes; etc.
  - I think that I am above-average here, but I have very high standards for myself and consider my current skill in this area to be very inadequate.
- For the record: I welcome well-meaning private feedback on what I might be biased about or messing up. On the other hand, having the feedback be public just pushes some of my buttons in a way which makes the situation hard for me to handle. I aspire for this not to be the case about me. That aspiration is not yet realized.
- I’ve worked hard to make this analysis honest and not optimized to make me look good or less silly. Probably I’ve still failed at least a little. Possibly I’ve missed something important. But this is what I’ve got.
- johnswentworth 16 Oct 2022 2:44 UTC
  LW: 4 AF: 4
  0
  AF Parent
  Kudos for writing all that out. Part of the reason I left that comment in the first place was because I thought “it’s Turner, if he’s actually motivatedly cognitating here he’ll notice once it’s pointed out”. (And, corollary: since you have the skill to notice when you are motivedly cognitating, I believe you if you say you aren’t. For most people, I do not consider their claims about motivatedness of their own cognition to be much evidence one way or the other.) I do have a fairly high opinion of your skills in that department.
  For the record: I welcome well-meaning private feedback on what I might be biased about or messing up.
  Fair point, that part of my comment probably should have been private. Mea culpa for that.