Dario/Anthropic-leadership are at least reasonably earnestly trying to do good things within their worldview
I think as stated this is probably true of the large majority of people, including e.g. the large majority of the most historically harmful people. “Worldviews” sometimes reflect underlying beliefs that lead people to choose actions, but they can of course also be formed post-hoc, to justify whatever choices they wished to make.
In some cases, one can gain evidence about which sort of “worldview” a person has, e.g. by checking it for coherency. But this isn’t really possible to do with Dario’s views on alignment, since to my knowledge, excepting the Concrete Problems paper he has actually not ever written anything about the alignment problem.[1] Given this, I think it’s reasonable to guess that he does not have a coherent set of views which he’s neglected to mention, so much as the more human-typical “set of post-hoc justifications.”
(In contrast, he discusses misuse regularly—and ~invariably changes the subject from alignment to misuse in interviews—in a way which does strike me as reflecting some non-trivial cognition).
I think… agree denotationally and (lean towards) disagreeing connotationally? (like, seems like this is implying “and because he doesn’t seem like he obviously has coherent views on alignment-in-particular, it’s not worth arguing the object level?”)
(to be clear, I don’t super expect this post to affect Dario’s decisionmaking models, esp. directly. I do have at least some hope for Anthropic employees to engage with these sorts of models/arguments, and my sense from talking them is that a lot of the LW-flavored arguments have often missed their cruxes)
No, I agree it’s worth arguing the object level. I just disagree that Dario seems to be “reasonably earnestly trying to do good things,” and I think this object-level consideration seems relevant (e.g., insofar as you take Anthropic’s safety strategy to rely on the good judgement of their staff).
I think (moderately likely, though not super confident) it makes more sense to model Dario as:
“a person who actually is quite worried about misuse, and is making significant strategic decisions around that (and doesn’t believe alignment is that hard)”
than as “a generic CEO who’s just generally following incentives and spinning narrative post-hoc rationalizations.”
Yeah, I buy that he cares about misuse. But I wouldn’t quite use the word “believe,” personally, about his acting as though alignment is easy—I think if he had actual models or arguments suggesting that, he probably would have mentioned them by now.
I think as stated this is probably true of the large majority of people, including e.g. the large majority of the most historically harmful people. “Worldviews” sometimes reflect underlying beliefs that lead people to choose actions, but they can of course also be formed post-hoc, to justify whatever choices they wished to make.
In some cases, one can gain evidence about which sort of “worldview” a person has, e.g. by checking it for coherency. But this isn’t really possible to do with Dario’s views on alignment, since to my knowledge, excepting the Concrete Problems paper he has actually not ever written anything about the alignment problem.[1] Given this, I think it’s reasonable to guess that he does not have a coherent set of views which he’s neglected to mention, so much as the more human-typical “set of post-hoc justifications.”
(In contrast, he discusses misuse regularly—and ~invariably changes the subject from alignment to misuse in interviews—in a way which does strike me as reflecting some non-trivial cognition).
Counterexamples welcome! I’ve searched a good bit and could not find anything, but it’s possible I missed something.
I think… agree denotationally and (lean towards) disagreeing connotationally? (like, seems like this is implying “and because he doesn’t seem like he obviously has coherent views on alignment-in-particular, it’s not worth arguing the object level?”)
(to be clear, I don’t super expect this post to affect Dario’s decisionmaking models, esp. directly. I do have at least some hope for Anthropic employees to engage with these sorts of models/arguments, and my sense from talking them is that a lot of the LW-flavored arguments have often missed their cruxes)
No, I agree it’s worth arguing the object level. I just disagree that Dario seems to be “reasonably earnestly trying to do good things,” and I think this object-level consideration seems relevant (e.g., insofar as you take Anthropic’s safety strategy to rely on the good judgement of their staff).
I think (moderately likely, though not super confident) it makes more sense to model Dario as:
“a person who actually is quite worried about misuse, and is making significant strategic decisions around that (and doesn’t believe alignment is that hard)”
than as “a generic CEO who’s just generally following incentives and spinning narrative post-hoc rationalizations.”
Yeah, I buy that he cares about misuse. But I wouldn’t quite use the word “believe,” personally, about his acting as though alignment is easy—I think if he had actual models or arguments suggesting that, he probably would have mentioned them by now.
I don’t particularly disagree with the first half, but your second sentence isn’t really a crux for me for the first part.