It sounds to me like you parsed my statement “One obvious takeaway here is that I should give my list of warnings-about-working-with-me to anyone who asks to discuss their alignment ideas with me, rather than just researchers I’m starting a collaboration with.” as me saying something like “I hereby adopt the solemn responsibility of warning people in advance, in all cases”, whereas I was interpreting it as more like “here’s a next thing to try!”.
I agree it would have been better of me to give direct bulldozing-warnings explicitly to Vivek’s hires.
(One obvious takeaway here is that I should give my list of warnings-about-working-with-me to anyone who asks to discuss their alignment ideas with me, rather than just researchers I’m starting a collaboration with. Obvious in hindsight; sorry for not doing that in your case.)
I agree that this statement does not explicitly say whether you would make this a one-time change or a permanent one. However, the tone and phrasing—”Obvious in hindsight; sorry for not doing that in your case”—suggested that you had learned from the experience and are likely to apply this lesson going forward. The use of the word “obvious”—twice—indicates to me that you believed that warnings are a clear improvement.
Ultimately, Nate, you wrote it. But I read it, and I don’t really see the “one-time experiment” interpretation. It just doesn’t make sense to me that it was “obvious in hindsight” that you should… adopt this “next thing to try”..?
In the above, I did not intend “here’s a next thing to try!” to be read like “here’s my next one-time experiment!”, but rather like “here’s a thing to add to my list of plausible ways to avoid this error-mode in the future, as is a virtuous thing to attempt!” (by contrast with “I hereby adopt this as a solemn responsibility”, as I hypothesize you interpreted me instead).
Dumping recollections, on the model that you want more data here:
I intended it as a general thing to try going forward, in a “seems like a sensible thing to do” sort of way (rather than in a “adopting an obligation to ensure it definitely gets done” sort of way).
After sending the email, I visualized people reaching out to me and asking if i wanted to chat about alignment (as you had, and as feels like a reconizable Event in my mind), and visualized being like “sure but FYI if we’re gonna do the alignment chat then maybe read these notes first”, and ran through that in my head a few times, as is my method for adopting such triggers.
I then also wrote down a task to expand my old “flaws list” (which was a collection of handles that I used as a memory-aid for having the “ways this could suck” chat, which I had, to that point, been having only verbally) into a written document, which eventually became the communication handbook (there were other contributing factors to that process also).
An older and different trigger (of “you’re hiring someone to work with directly on alignment”) proceeded to fire when I hired Vivek (if memory serves), and (if memory serves) I went verbally through my flaws list.
Neither the new nor the old triggers fired in the case of Vivek hiring employees, as discussed elsewhere.
Thomas Kwa heard from a friend that I was drafting a handbook (chat logs say this occured on Nov 30); it was still in a form I wasn’t terribly pleased with and so I said the friend could share a redacted version that contained the parts that I was happier with and that felt more relevant.
Around Jan 8, in an unrelated situation, I found myself in a series of conversations where I sent around the handbook and made use of it. I pushed it closer to completion in Jan 8-10 (according to Google doc’s history).
The results of that series of interactions, and of Vivek’s team’s (lack of) use of the handbook caused me to update away from this method being all that helpful. In particular: nobody at any point invoked one of the affordances or asked for one of the alternative conversation modes (though those sorts of things did seem to help when I personally managed to notice building frustration and personally suggest that we switch modes (although lying on the ground—a friend’s suggestion—turned out to work better for others than switching to other conversation modes)). This caused me to downgrade (in my head) the importance of ensuring that people had access to those resources.
I think that at some point around then I shared the fuller guide with Vivek’s team, but I didn’t quickly detemine when from the chat logs. Sometime between Nov 30 and Feb 22, presumably.
It looks from my chat logs like I then finished the draft around Feb 22 (where I have a timestamp from me noting as much to a friend). I probably put it publicly on my website sometime around then (though I couldn’t easily find a timestamp), and shared it with Vivek’s team (if I hadn’t already).
The next two MIRI hires both mentioned to me that they’d read my communication handbook (and I did not anticipate spending a bunch of time with them, nevermind on technical research), so they both didn’t trigger my “warn them” events and (for better or worse) I had them mentally filed away as “has seen the affordances list and the failure modes section”.
I appreciate the detail, thanks. In particular, I had wrongly assumed that the handbook had been written much earlier, such that even Vivek could have been shown it before deciding to work with you. This also makes more sense of your comments that “writing the handbook” was indicative of effort on your part, since our July interaction.
Overall, I retain my very serious concerns, which I will clarify in another comment, but am more in agreement with claims like “Nate has put in effort of some kind since the July chat.”
The next two MIRI hires both mentioned to me that they’d read my communication handbook
Noting that at least one of them read the handbook because I warned them and told them to go ask around about interacting with you, to make sure they knew what they were getting into.
I warned the immediately-next person.
It sounds to me like you parsed my statement “One obvious takeaway here is that I should give my list of warnings-about-working-with-me to anyone who asks to discuss their alignment ideas with me, rather than just researchers I’m starting a collaboration with.” as me saying something like “I hereby adopt the solemn responsibility of warning people in advance, in all cases”, whereas I was interpreting it as more like “here’s a next thing to try!”.
I agree it would have been better of me to give direct bulldozing-warnings explicitly to Vivek’s hires.
Here is the statement:
I agree that this statement does not explicitly say whether you would make this a one-time change or a permanent one. However, the tone and phrasing—”Obvious in hindsight; sorry for not doing that in your case”—suggested that you had learned from the experience and are likely to apply this lesson going forward. The use of the word “obvious”—twice—indicates to me that you believed that warnings are a clear improvement.
Ultimately, Nate, you wrote it. But I read it, and I don’t really see the “one-time experiment” interpretation. It just doesn’t make sense to me that it was “obvious in hindsight” that you should… adopt this “next thing to try”..?
I did not intend it as a one-time experiment.
In the above, I did not intend “here’s a next thing to try!” to be read like “here’s my next one-time experiment!”, but rather like “here’s a thing to add to my list of plausible ways to avoid this error-mode in the future, as is a virtuous thing to attempt!” (by contrast with “I hereby adopt this as a solemn responsibility”, as I hypothesize you interpreted me instead).
Dumping recollections, on the model that you want more data here:
I intended it as a general thing to try going forward, in a “seems like a sensible thing to do” sort of way (rather than in a “adopting an obligation to ensure it definitely gets done” sort of way).
After sending the email, I visualized people reaching out to me and asking if i wanted to chat about alignment (as you had, and as feels like a reconizable Event in my mind), and visualized being like “sure but FYI if we’re gonna do the alignment chat then maybe read these notes first”, and ran through that in my head a few times, as is my method for adopting such triggers.
I then also wrote down a task to expand my old “flaws list” (which was a collection of handles that I used as a memory-aid for having the “ways this could suck” chat, which I had, to that point, been having only verbally) into a written document, which eventually became the communication handbook (there were other contributing factors to that process also).
An older and different trigger (of “you’re hiring someone to work with directly on alignment”) proceeded to fire when I hired Vivek (if memory serves), and (if memory serves) I went verbally through my flaws list.
Neither the new nor the old triggers fired in the case of Vivek hiring employees, as discussed elsewhere.
Thomas Kwa heard from a friend that I was drafting a handbook (chat logs say this occured on Nov 30); it was still in a form I wasn’t terribly pleased with and so I said the friend could share a redacted version that contained the parts that I was happier with and that felt more relevant.
Around Jan 8, in an unrelated situation, I found myself in a series of conversations where I sent around the handbook and made use of it. I pushed it closer to completion in Jan 8-10 (according to Google doc’s history).
The results of that series of interactions, and of Vivek’s team’s (lack of) use of the handbook caused me to update away from this method being all that helpful. In particular: nobody at any point invoked one of the affordances or asked for one of the alternative conversation modes (though those sorts of things did seem to help when I personally managed to notice building frustration and personally suggest that we switch modes (although lying on the ground—a friend’s suggestion—turned out to work better for others than switching to other conversation modes)). This caused me to downgrade (in my head) the importance of ensuring that people had access to those resources.
I think that at some point around then I shared the fuller guide with Vivek’s team, but I didn’t quickly detemine when from the chat logs. Sometime between Nov 30 and Feb 22, presumably.
It looks from my chat logs like I then finished the draft around Feb 22 (where I have a timestamp from me noting as much to a friend). I probably put it publicly on my website sometime around then (though I couldn’t easily find a timestamp), and shared it with Vivek’s team (if I hadn’t already).
The next two MIRI hires both mentioned to me that they’d read my communication handbook (and I did not anticipate spending a bunch of time with them, nevermind on technical research), so they both didn’t trigger my “warn them” events and (for better or worse) I had them mentally filed away as “has seen the affordances list and the failure modes section”.
I appreciate the detail, thanks. In particular, I had wrongly assumed that the handbook had been written much earlier, such that even Vivek could have been shown it before deciding to work with you. This also makes more sense of your comments that “writing the handbook” was indicative of effort on your part, since our July interaction.
Overall, I retain my very serious concerns, which I will clarify in another comment, but am more in agreement with claims like “Nate has put in effort of some kind since the July chat.”
Noting that at least one of them read the handbook because I warned them and told them to go ask around about interacting with you, to make sure they knew what they were getting into.