dxu comments on Discussion with Nate Soares on a key alignment difficulty

dxu 14 Mar 2023 0:04 UTC
24 points
15
So the place that my brain reports it gets its own confidence from, is from having done exercises that amount to self-play in the game I mentioned in a thread a little while back, which gives me a variety of intuitions about the rows in your table (where I’m like “doing science well requires CIS-ish stuff” and “the sort of corrigibility you learn in training doesn’t generalize how we want, b/c of the interactions w/ the CIS-ish stuff”)
(that plus the way that people who hope the game goes the other way, seem to generally be arguing not from the ability to exhibit playthroughs that go some other way, but instead be arguing from ignorance / “we just don’t know”)
I like this answer a lot; as someone whose intuition on this matter has never really produced conclusions as sharp/confident as Nate’s, I think this is the first time I’ve seen Nate’s models queried concretely enough to produce a clear story as to where his intuitions are coming from. (Kudos to the both of you for managing to turn some of these intuitions into words!)
Of course, I also agree with some of the things Holden says in his list of disagreements, which I personally would characterize as “just because I now have a sense of the story you’re telling yourself, doesn’t mean I trust that story without having stepped through it myself.” That’s pretty inevitable, and I think represents a large chunk of the remaining disagreement. My sense is that Nate views his attempts at self-play in his “game” as a form of “empirical” evidence, which is a somewhat unusual way of slicing things, but not obviously mistaken from where I currently stand: even if I don’t necessarily buy that Nate has stepped through things correctly, I can totally buy that playing around with toy examples is a great way to sharpen and refine intuitions (as has historically been the case in e.g. physics—Einstein being a very popular example).
This being the case, I think that one of the biggest blockers to Nate/MIRI’s views spreading more broadly (assuming that their views broadly are correct), in the absence of the empirical indicators Holden outlined in the last section, is for someone to just sit down and… do the exercise. (I could imagine a world where someone sat down with Nate/Eliezer, played the game in question, and left feeling like they hadn’t learned much—because e.g. Nate/Eliezer spent most of the time playing word games—but that isn’t my modal world, which is much more one in which their interlocutor leaves feeling like they’ve learned something, even if they haven’t fully updated towards complete endorsement of the MIRI-pilled view.)