It feels to me like lots of alignment folk ~only make negative updates. For example, “Bing Chat is evidence of misalignment”, but also “ChatGPT is not evidence of alignment.” (I don’t know that there is in fact a single person who believes both, but my straw-models of a few people believe both.)
For what it’s worth, as one of the people who believes “ChatGPT is not evidence of alignment-of-the-type-that-matters”, I don’t believe “Bing Chat is evidence of misalignment-of-the-type-that-matters”.
I believe the alignment of the outward behavior of simulacra is only very tenuously related to the alignment of the underlying AI, so both things provide ~no data on that (in a similar way to how our ability or inability to control the weather is entirely unrelated to alignment).
(I at least believe the latter but not the former. I know a few people who updated downwards on the societal response because of Bing Chat, because if a system looks that legibly scary and we still just YOLO it, then that means there is little hope of companies being responsible here, but none because they thought it was evidence of alignment being hard, I think?)
I dunno, my p(doom) over time looks pretty much like a random walk to me: 60% mid 2020, down to 50% in early 2022, 85% mid 2022, down to 80% in early 2023, down to 65% now.
I did not update towards misalignment at all on bing chat. I also do not think chatgpt is (strong) evidence of alignment. I generally think anyone who already takes alignment as a serious concern at all should not update on bing chat, except perhaps in the department of “do things like bing chat, which do not actually provide evidence for misalignment, cause shifts in public opinion?”
I think a lot of alignment folk have made positive updates in response to the societal response to AI xrisk.
This is probably different than what you’re pointing at (like maybe your claim is more like “Lots of alignment folks only make negative updates when responding to technical AI developments” or something like that).
That said, I don’t think the examples you give are especially compelling. I think the following position is quite reasonable (and I think fairly common):
Bing Chat provides evidence that some frontier AI companies will fail at alignment even on relatively “easy” problems that we know how to solve with existing techniques. Also, as Habryka mentioned, it’s evidence that the underlying competitive pressures will make some companies “YOLO” and take excessive risk. This doesn’t affect the absolute difficultly of alignment but it affects the probability that Earth will actually align AGI.
ChatGPT provides evidence that we can steer the behavior of current large language models. People who predicted that it would be hard to align large language models should update. IMO, many people seem to have made mild updates here, but not strong ones, because they (IMO correctly) claim that their threat models never had strong predictions about the kinds of systems we’re currently seeing and instead predicted that we wouldn’t see major alignment problems until we get smarter systems (e.g., systems with situational awareness and more coherent goals).
(My “Alex sim”– which is not particularly strong– says that maybe these people are just post-hoc rationalizing– like if you had asked them in 2015 how likely we would be to be able to control modern LLMs, they would’ve been (a) wrong and (b) wrong in an important way– like, their model of how hard it would be to control modern LLMs is very interconnected with their model of why it would be hard to control AGI/superintelligence. Personally, I’m pretty sympathetic to the point that many models of why alignment of AGI/superintelligence would be hard seem relatively disconnected to any predictions about modern LLMs, such that only “small/mild” updates seem appropriate for people who hold those models.)
For the record, I updated on ChatGPT. I think that the classic example of imagining telling an AI to get a coffee and it pushes a kid out of the way isn’t so much of a concern any more. So the remaining concerns seem to be inner alignment + outer alignment far outside normal human experience + value lock-in.
I’ve noticed that for many people (including myself), their subjective P(doom) stays surprisingly constant over time. And I’ve wondered if there’s something like “conservation of subjective P(doom)”—if you become more optimistic about some part of AI going better, then you tend to become more pessimistic about some other part, such that your P(doom) stays constant. I’m like 50% confident that I myself do something like this.
(ETA: Of course, there are good reasons subjective P(doom) might remain constant, e.g. if most of your uncertainty is about the difficulty of the underlying alignment problem and you don’t think we’ve been learning much about that.)
A lot of the people around me (e.g. who I speak to ~weekly) seem to be sensitive to both new news and new insights, adapting both their priorities and their level of optimism[1]. I think you’re right about some people. I don’t know what ‘lots of alignment folk’ means, and I’ve not considered the topic of other-people’s-update-rates-and-biases much.
For me, most changes route via governance.
I have made mainly very positive updates on governance in the last ~year, in part from public things and in part from private interactions.
Seemingly-mindkilled discourse on East-West competition provided me some negative updates, but recent signs of life from govts at e.g. the UK Safety Summit have undone those for now, maybe even going the other way.
I’ve adapted my own priorities in light of all of these (and I think this adaptation is much more important than what my P(doom) does).
Besides their second-order impact on Overton etc. I have made very few updates based on public research/deployment object-level since 2020. Nothing has been especially surprising.
From deeper study and personal insights, I’ve made some negative updates based on a better appreciation of multi-agent challenges since 2021 when I started to think they were neglected.
I could say other stuff about personal research/insights but they mainly change what I do/prioritise/say, not how pessimistic I am.
I’ve often thought that P(doom) is basically a distraction and what matters is how new news and insights affect your priorities. Of course, nevertheless, I presumably have a (revealed) P(doom) with some level of resolution.
It feels to me like lots of alignment folk ~only make negative updates. For example, “Bing Chat is evidence of misalignment”, but also “ChatGPT is not evidence of alignment.” (I don’t know that there is in fact a single person who believes both, but my straw-models of a few people believe both.)
For what it’s worth, as one of the people who believes “ChatGPT is not evidence of alignment-of-the-type-that-matters”, I don’t believe “Bing Chat is evidence of misalignment-of-the-type-that-matters”.
I believe the alignment of the outward behavior of simulacra is only very tenuously related to the alignment of the underlying AI, so both things provide ~no data on that (in a similar way to how our ability or inability to control the weather is entirely unrelated to alignment).
(I at least believe the latter but not the former. I know a few people who updated downwards on the societal response because of Bing Chat, because if a system looks that legibly scary and we still just YOLO it, then that means there is little hope of companies being responsible here, but none because they thought it was evidence of alignment being hard, I think?)
I dunno, my p(doom) over time looks pretty much like a random walk to me: 60% mid 2020, down to 50% in early 2022, 85% mid 2022, down to 80% in early 2023, down to 65% now.
Psst, look at the calibration on this guy
I did not update towards misalignment at all on bing chat. I also do not think chatgpt is (strong) evidence of alignment. I generally think anyone who already takes alignment as a serious concern at all should not update on bing chat, except perhaps in the department of “do things like bing chat, which do not actually provide evidence for misalignment, cause shifts in public opinion?”
I think a lot of alignment folk have made positive updates in response to the societal response to AI xrisk.
This is probably different than what you’re pointing at (like maybe your claim is more like “Lots of alignment folks only make negative updates when responding to technical AI developments” or something like that).
That said, I don’t think the examples you give are especially compelling. I think the following position is quite reasonable (and I think fairly common):
Bing Chat provides evidence that some frontier AI companies will fail at alignment even on relatively “easy” problems that we know how to solve with existing techniques. Also, as Habryka mentioned, it’s evidence that the underlying competitive pressures will make some companies “YOLO” and take excessive risk. This doesn’t affect the absolute difficultly of alignment but it affects the probability that Earth will actually align AGI.
ChatGPT provides evidence that we can steer the behavior of current large language models. People who predicted that it would be hard to align large language models should update. IMO, many people seem to have made mild updates here, but not strong ones, because they (IMO correctly) claim that their threat models never had strong predictions about the kinds of systems we’re currently seeing and instead predicted that we wouldn’t see major alignment problems until we get smarter systems (e.g., systems with situational awareness and more coherent goals).
(My “Alex sim”– which is not particularly strong– says that maybe these people are just post-hoc rationalizing– like if you had asked them in 2015 how likely we would be to be able to control modern LLMs, they would’ve been (a) wrong and (b) wrong in an important way– like, their model of how hard it would be to control modern LLMs is very interconnected with their model of why it would be hard to control AGI/superintelligence. Personally, I’m pretty sympathetic to the point that many models of why alignment of AGI/superintelligence would be hard seem relatively disconnected to any predictions about modern LLMs, such that only “small/mild” updates seem appropriate for people who hold those models.)
For the record, I updated on ChatGPT. I think that the classic example of imagining telling an AI to get a coffee and it pushes a kid out of the way isn’t so much of a concern any more. So the remaining concerns seem to be inner alignment + outer alignment far outside normal human experience + value lock-in.
I’ve noticed that for many people (including myself), their subjective P(doom) stays surprisingly constant over time. And I’ve wondered if there’s something like “conservation of subjective P(doom)”—if you become more optimistic about some part of AI going better, then you tend to become more pessimistic about some other part, such that your P(doom) stays constant. I’m like 50% confident that I myself do something like this.
(ETA: Of course, there are good reasons subjective P(doom) might remain constant, e.g. if most of your uncertainty is about the difficulty of the underlying alignment problem and you don’t think we’ve been learning much about that.)
(Updating a bit because of these responses—thanks, everyone, for responding! I still believe the first sentence, albeit a tad less strongly.)
A lot of the people around me (e.g. who I speak to ~weekly) seem to be sensitive to both new news and new insights, adapting both their priorities and their level of optimism[1]. I think you’re right about some people. I don’t know what ‘lots of alignment folk’ means, and I’ve not considered the topic of other-people’s-update-rates-and-biases much.
For me, most changes route via governance.
I have made mainly very positive updates on governance in the last ~year, in part from public things and in part from private interactions.
I’ve also made negative (evidential) updates based on the recent OpenAI kerfuffle (more weak evidence that Sam+OpenAI is misaligned; more evidence that org oversight doesn’t work well), though I think the causal fallout remains TBC.
Seemingly-mindkilled discourse on East-West competition provided me some negative updates, but recent signs of life from govts at e.g. the UK Safety Summit have undone those for now, maybe even going the other way.
I’ve adapted my own priorities in light of all of these (and I think this adaptation is much more important than what my P(doom) does).
Besides their second-order impact on Overton etc. I have made very few updates based on public research/deployment object-level since 2020. Nothing has been especially surprising.
From deeper study and personal insights, I’ve made some negative updates based on a better appreciation of multi-agent challenges since 2021 when I started to think they were neglected.
I could say other stuff about personal research/insights but they mainly change what I do/prioritise/say, not how pessimistic I am.
I’ve often thought that P(doom) is basically a distraction and what matters is how new news and insights affect your priorities. Of course, nevertheless, I presumably have a (revealed) P(doom) with some level of resolution.