The model was something like: Nate and Eliezer have a mindset that’s good for both capabilities and alignment, and so if we talk to other alignment researchers about our work, the mindset will diffuse into the alignment community, and thence to OpenAI, where it would speed up capabilities. I think we didn’t have enough evidence to believe this, and should have shared more.
What evidence were you working off of? This is an extraordinary thing to believe.
First I should note that Nate is the one who most believed this; that we not share ideas that come from Nate was a precondition of working with him. [edit: this wasn’t demanded by Nate except in a couple of cases, but in practice we preferred to get Nate’s input because his models were different from ours.]
With that out of the way, it doesn’t seem super implausible to us that the mindset is useful, given that MIRI had previously invented out of the box things like logical induction and logical decision theory, and that many of us feel like we learned a lot over the past year. On inside view I have much better ideas than I did a year ago, although it’s unclear how much to attribute to the Nate mindset. To estimate this it’s super unclear how to make a reference class—I’d maybe guess at the base rate of mindsets transferring from niche fields to large fields and adjust from there. We spent way too much time discussing how. Let’s say there’s a 15%* chance of usefulness.
As for whether the mindset would diffuse conditional on it being useful, this seems pretty plausible, maybe 15% if we’re careful and 50% if we talk to lots of people but don’t publish? Scientific fields are pretty good at spreading useful ideas.
So I think the whole proposition is unlikely but not “extraordinary”, maybe like 2.5-7.5%. Previously I was more confident in some of the methods so I would have given 45% for useful, making p(danger) 7%-22.5%. The scenario we were worried about is if our team had a low probability of big success (e.g. due to inexperience), but sharing ideas would cause a fixed 15% chance of big capabilities externalities regardless of success. The project could easily become -EV this way.
Some other thoughts:
Nate trusts his inside view more than any of our attempts to construct an argument legible to me which I think distorted our attempts to discuss this.
It’s hard to tell 10% from 1% chances for propositions like this, which is also one of the problems in working on long-shot, hopefully high EV projects.
Part of why I wanted the project to become less private as it went on is that we generated actual research directions and would only have to share object level to get feedback on our ideas.
[edit: My understanding is that] Eliezer and Nate believe this. I think it’s quite reasonable for other people to be skeptical of it.
Nate and Eliezer can choose to only work closely/mentor people who opt into some kind of confidentiality clause about it. People who are skeptical or don’t think it’s worth the costs can choose not to opt into it.
I have heard a few people talk about MIRI confidentiality norms being harmful to them in various ways, so I do also think it’s quite reasonable for people to be more cautious about opting into working with Nate or Eliezer if they don’t think it’s worth the cost.
Presumably, Nate/Eliezer aren’t willing to talk much about this precisely because they think it’d leak capabilities. You might think they’re wrong, or that they haven’t justified that, but, like, the people who have a stake in this are the people who are deciding whether to work with them. (I think there’s also a question of “should Eliezer/Nate have a reputation as people who have a mindset that’s good for alignment and capabilities that’d be bad to leak?”, and I’d say the answer should be “not any moreso than you can detect from their public writings, and whatever your personal chains of trust with people who have worked closely with them that you’ve talked to.”)
I do think this leaves some problems. I have heard about the MIRI confidentiality norms being fairly paralyzing for some people in important ways. But something about the Muireall’s comment felt like a wrong frame to me.
(I am pretty uncomfortable with all the “Nate / Eliezer” going on here. Let’s at least let people’s misunderstandings of me be limited to me personally, and not bleed over into Eliezer!)
(In terms of the allegedly-extraordinary belief, I recommend keeping in mind jimrandomh’s note on Fork Hazards. I have probability mass on the hypothesis that I have ideas that could speed up capabilities if I put my mind to it, as is a very different state of affairs from being confident that any of my ideas works. Most ideas don’t work!)
(Separately, the infosharing agreement that I set up with Vivek—as was perhaps not successfully relayed to the rest of the team, though I tried to express this to the whole team on various occasions—was one where they owe their privacy obligations to Vivek and his own best judgements, not to me.)
(Separately, the infosharing agreement that I set up with Vivek—as was perhaps not successfully relayed to the rest of the team, though I tried to express this to the whole team on various occasions—was one where they owe their privacy obligations to Vivek and his own best judgements, not to me.)
That’s useful additional information, thanks.
I made a slight edit to my previous comment to make my epistemic state more clear.
Fwiw, I feel like I have a pretty crisp sense of “Nate and Eliezers communication styles are actually pretty different” (I noticed myself writing out a similar comment about communication styles under the Turntrout thread that initially said “Nate and Eliezer” a lot, and then decided that comment didn’t make sense to publish as-is), but I don’t actually have much of a sense of the difference between Nate, Eliezer, and MIRI-as-a-whole with regards to “the mindset” and “confidentiality norms”.
Sure. I only meant to use Thomas’s frame, where it sounds like Thomas did originally accept Nate’s model on some evidence, but now feels it wasn’t enough evidence. What was originally persuasive enough to opt in? I haven’t followed all Nate’s or Eliezer’s public writing, so I’d be plenty interested in an answer that draws only from what someone can detect from their public writing. I don’t mean to demand evidence from behind the confidentiality screen, even if that’s the main kind of evidence that exists.
Separately, I am skeptical and a little confused as to what this could even look like, but that’s not what I meant to express in my comment.
What evidence were you working off of? This is an extraordinary thing to believe.
First I should note that Nate is the one who most believed this; that we not share ideas that come from Nate was a precondition of working with him. [edit: this wasn’t demanded by Nate except in a couple of cases, but in practice we preferred to get Nate’s input because his models were different from ours.]
With that out of the way, it doesn’t seem super implausible to us that the mindset is useful, given that MIRI had previously invented out of the box things like logical induction and logical decision theory, and that many of us feel like we learned a lot over the past year. On inside view I have much better ideas than I did a year ago, although it’s unclear how much to attribute to the Nate mindset. To estimate this it’s super unclear how to make a reference class—I’d maybe guess at the base rate of mindsets transferring from niche fields to large fields and adjust from there. We spent way too much time discussing how. Let’s say there’s a 15%* chance of usefulness.
As for whether the mindset would diffuse conditional on it being useful, this seems pretty plausible, maybe 15% if we’re careful and 50% if we talk to lots of people but don’t publish? Scientific fields are pretty good at spreading useful ideas.
So I think the whole proposition is unlikely but not “extraordinary”, maybe like 2.5-7.5%. Previously I was more confident in some of the methods so I would have given 45% for useful, making p(danger) 7%-22.5%. The scenario we were worried about is if our team had a low probability of big success (e.g. due to inexperience), but sharing ideas would cause a fixed 15% chance of big capabilities externalities regardless of success. The project could easily become -EV this way.
Some other thoughts:
Nate trusts his inside view more than any of our attempts to construct an argument legible to me which I think distorted our attempts to discuss this.
It’s hard to tell 10% from 1% chances for propositions like this, which is also one of the problems in working on long-shot, hopefully high EV projects.
Part of why I wanted the project to become less private as it went on is that we generated actual research directions and would only have to share object level to get feedback on our ideas.
* Every number in this comment is very rough
Interesting, thanks!
This isn’t quite how I’d frame the question.
[edit: My understanding is that] Eliezer and Nate believe this. I think it’s quite reasonable for other people to be skeptical of it.
Nate and Eliezer can choose to only work closely/mentor people who opt into some kind of confidentiality clause about it. People who are skeptical or don’t think it’s worth the costs can choose not to opt into it.
I have heard a few people talk about MIRI confidentiality norms being harmful to them in various ways, so I do also think it’s quite reasonable for people to be more cautious about opting into working with Nate or Eliezer if they don’t think it’s worth the cost.
Presumably, Nate/Eliezer aren’t willing to talk much about this precisely because they think it’d leak capabilities. You might think they’re wrong, or that they haven’t justified that, but, like, the people who have a stake in this are the people who are deciding whether to work with them. (I think there’s also a question of “should Eliezer/Nate have a reputation as people who have a mindset that’s good for alignment and capabilities that’d be bad to leak?”, and I’d say the answer should be “not any moreso than you can detect from their public writings, and whatever your personal chains of trust with people who have worked closely with them that you’ve talked to.”)
I do think this leaves some problems. I have heard about the MIRI confidentiality norms being fairly paralyzing for some people in important ways. But something about the Muireall’s comment felt like a wrong frame to me.
(I am pretty uncomfortable with all the “Nate / Eliezer” going on here. Let’s at least let people’s misunderstandings of me be limited to me personally, and not bleed over into Eliezer!)
(In terms of the allegedly-extraordinary belief, I recommend keeping in mind jimrandomh’s note on Fork Hazards. I have probability mass on the hypothesis that I have ideas that could speed up capabilities if I put my mind to it, as is a very different state of affairs from being confident that any of my ideas works. Most ideas don’t work!)
(Separately, the infosharing agreement that I set up with Vivek—as was perhaps not successfully relayed to the rest of the team, though I tried to express this to the whole team on various occasions—was one where they owe their privacy obligations to Vivek and his own best judgements, not to me.)
That’s useful additional information, thanks.
I made a slight edit to my previous comment to make my epistemic state more clear.
Fwiw, I feel like I have a pretty crisp sense of “Nate and Eliezers communication styles are actually pretty different” (I noticed myself writing out a similar comment about communication styles under the Turntrout thread that initially said “Nate and Eliezer” a lot, and then decided that comment didn’t make sense to publish as-is), but I don’t actually have much of a sense of the difference between Nate, Eliezer, and MIRI-as-a-whole with regards to “the mindset” and “confidentiality norms”.
Sure. I only meant to use Thomas’s frame, where it sounds like Thomas did originally accept Nate’s model on some evidence, but now feels it wasn’t enough evidence. What was originally persuasive enough to opt in? I haven’t followed all Nate’s or Eliezer’s public writing, so I’d be plenty interested in an answer that draws only from what someone can detect from their public writing. I don’t mean to demand evidence from behind the confidentiality screen, even if that’s the main kind of evidence that exists.
Separately, I am skeptical and a little confused as to what this could even look like, but that’s not what I meant to express in my comment.