Fair enough! Okay: how do your views on timelines and takeoffs compare to the average OpenAI employee’s? Is there an active internal debate going on at OpenAI about takeoff speeds and timelines? If so, has that debate been shifting over time, and if so, how? When Sam says it will be a “very slow takeoff” does that strike you as ridiculous/misguided/plausible? Why do you think your opinion and Sam’s opinion diverges so strongly? And in the end, do you think opinion’s such as Sam’s (as the leader of OpenAI) even matter? Or do you think things like how fast this all goes are largely out of individual’s hands at this point?
I don’t know enough to say, and even if I did I probably wouldn’t want to speak on behalf of everyone else without getting approval first. I haven’t talked about it with higher-ups yet. I’d love to get into the weeds with them and work through e.g. the takeoffspeeds.com model and see if there’s anything we disagree about and drill down to cruxes. On vibes it seems like we probably disagree big time but I’m not completely sure.
I think Sam says he used to be much more fast-takeoff-y than he is now? Not sure.
As for what I consider implausible, I’d say: Start with takeoffspeeds.com and fiddle around with all the settings there. If it’s basically impossible to torture the model into giving curves that look similar to your view, your view is implausible and burden of proof is on you to give a different model that yields curves that you expect.
I’m broadly persuaded by your timelines, especially as they’ve been playing out so far, and I think my own parameters would probably lie somewhere on the more aggressive end between aggressive and best guess.
So I suppose a better question might be, how do you feel your own views are received within OpenAI? And given your own prediction of faster timelines and takeoffs, what do you view as your most critical work at OpenAI during this crunch time?
I haven’t exactly been going around preaching my views, though I haven’t been hiding them either. I hope the research I do will feed up the chain into better decisions at the top. Most of my time is spent working on dangerous capability evals (similar to what ARC did) these days. I’m actually tentatively optimistic that the plan they feed into (society coordinates to not build AIs with sufficiently dangerous capabilities, until more alignment & security work has been done) might actually succeed. Overall my p(doom) is at something like 80%, because I think the situation humanity is in is pretty dire, but I can at least see a speck of hope, a shoreline to swim towards so to speak.
Thanks again! I think your perspective is super valuable.
I agree that the ARC eval, and further like it, are one of my biggest streams of hope that we’ll get a clear sign to truly take a pause on capabilities before it’s potentially too late. So thank you in addition for doing that work!
I have this cognitive dissonance where I feel like my p(doom) should be more in line with yours, but for some reason I have some weird faith in us to pull through here somehow. Not perfectly, not without damage, but somehow...I don’t know why, though, exactly. Maybe this recent societal response the last week or two gives me hope that all of this will be taken very seriously?
I have this cognitive dissonance where I feel like my p(doom) should be more in line with yours, but for some reason I have some weird faith in us to pull through here somehow. Not perfectly, not without damage, but somehow...
I think the reason I don’t align with Daniel Kokotajlo’s p(doom) is 2 things:
I suspect alignment is while not easy, I do think it’s solvable and in particular, I’m very impressed by the progress on core alignment problems to date. In particular, Pretraining from Human Feedback did very well on solving certain problems of alignment.
I have non-trivial probability mass on the outcome where we have very little dignity and little to no progress on alignment by the time superhuman AI systems come up, and yet our civilization survives. In essence, I believe there’s a non-trivial probability of success with superhuman AI without dignity in the alignment problem.
Seems like 2 is our crux; I definitely agree that alignment is solvable & that there has been some substantial progress on core alignment problems so far. (Though I really wouldn’t classify pretraining from human feedback as substantial progress on core problems—I’d be interested to hear more about this from you, wanna explain what core problem it’s helping solve?)
For our main crux, if you want to discuss now (fair if you don’t) I’d be interested to hear a concrete story (or sketch of a story) that you think is plausible in which we survive despite little dignity and basically no alignment progress before superhuman AI systems come up.
(Though I really wouldn’t classify pretraining from human feedback as substantial progress on core problems—I’d be interested to hear more about this from you, wanna explain what core problem it’s helping solve?)
The major problem that was solved is outer alignment, which is essentially which goals we should pick for our AI, and in particular as you get more data, the AI gets more aligned.
This is crucial for aligning superhuman AI.
The PII task also touches on something very important for AI safety, which is can we prevent instrumentally convergent goals if they are aligned with human values? And the answer to that is a tentative yes, given that AI does less taking of personally identifiable information as it scales with more data.
For our main crux, if you want to discuss now (fair if you don’t) I’d be interested to hear a concrete story (or sketch of a story) that you think is plausible in which we survive despite little dignity and basically no alignment progress before superhuman AI systems come up.
A major part of this is coming from Holden Karnofsky’s success without dignity post that I’ll link below, and the major part of that story is that we may have a dumpster fire on our hands with AI safety, but that doesn’t mean we can’t succeed. It’s possible that the AI Alignment problem is really easy, such that a method just works, and conditioning on alignment being easy, while the world gets quite a bit more dangerous due to technology, a large proportion of aligned AIs vs a small proportion of Misaligned AIs is probably a scenario where humanity endures. It will be weird and dangerous, but probably not existential.
Thanks! Huh, I really don’t see how this solves outer alignment. I thought outer alignment was already mostly solved by some combination of imitation + amplification + focusing on intermediate narrow tasks like designing brain scanners… but yeah I guess I should go read the paper more carefully?
I don’t see how pretraining from human feedback has the property “as you get more data, the AI gets more aligned.” Why doesn’t it suffer from all the classic problems of inner alignment / deception / playing the training game / etc.?
I love Holden’s post but I give something like 10% credence that we get success via that route, rather than the 80% or so that it seems like you give, in order to get a mere 10% p(doom). The main crux for me here is timelines; I think things are going to happen too quickly for us to implement all the things Holden talks about in that post. Even though he describes them as pretty basic. Note that there are a ton of even more basic safety things that current AI labs aren’t implementing because they don’t think the problem is serious & they are under competitive pressure to race.
Fair enough! Okay: how do your views on timelines and takeoffs compare to the average OpenAI employee’s? Is there an active internal debate going on at OpenAI about takeoff speeds and timelines? If so, has that debate been shifting over time, and if so, how? When Sam says it will be a “very slow takeoff” does that strike you as ridiculous/misguided/plausible? Why do you think your opinion and Sam’s opinion diverges so strongly? And in the end, do you think opinion’s such as Sam’s (as the leader of OpenAI) even matter? Or do you think things like how fast this all goes are largely out of individual’s hands at this point?
I don’t know enough to say, and even if I did I probably wouldn’t want to speak on behalf of everyone else without getting approval first. I haven’t talked about it with higher-ups yet. I’d love to get into the weeds with them and work through e.g. the takeoffspeeds.com model and see if there’s anything we disagree about and drill down to cruxes. On vibes it seems like we probably disagree big time but I’m not completely sure.
I think Sam says he used to be much more fast-takeoff-y than he is now? Not sure.
As for what I consider implausible, I’d say: Start with takeoffspeeds.com and fiddle around with all the settings there. If it’s basically impossible to torture the model into giving curves that look similar to your view, your view is implausible and burden of proof is on you to give a different model that yields curves that you expect.
Thank you for the further answers!
I’m broadly persuaded by your timelines, especially as they’ve been playing out so far, and I think my own parameters would probably lie somewhere on the more aggressive end between aggressive and best guess.
So I suppose a better question might be, how do you feel your own views are received within OpenAI? And given your own prediction of faster timelines and takeoffs, what do you view as your most critical work at OpenAI during this crunch time?
I haven’t exactly been going around preaching my views, though I haven’t been hiding them either. I hope the research I do will feed up the chain into better decisions at the top. Most of my time is spent working on dangerous capability evals (similar to what ARC did) these days. I’m actually tentatively optimistic that the plan they feed into (society coordinates to not build AIs with sufficiently dangerous capabilities, until more alignment & security work has been done) might actually succeed. Overall my p(doom) is at something like 80%, because I think the situation humanity is in is pretty dire, but I can at least see a speck of hope, a shoreline to swim towards so to speak.
Thanks again! I think your perspective is super valuable.
I agree that the ARC eval, and further like it, are one of my biggest streams of hope that we’ll get a clear sign to truly take a pause on capabilities before it’s potentially too late. So thank you in addition for doing that work!
I have this cognitive dissonance where I feel like my p(doom) should be more in line with yours, but for some reason I have some weird faith in us to pull through here somehow. Not perfectly, not without damage, but somehow...I don’t know why, though, exactly. Maybe this recent societal response the last week or two gives me hope that all of this will be taken very seriously?
I think the reason I don’t align with Daniel Kokotajlo’s p(doom) is 2 things:
I suspect alignment is while not easy, I do think it’s solvable and in particular, I’m very impressed by the progress on core alignment problems to date. In particular, Pretraining from Human Feedback did very well on solving certain problems of alignment.
I have non-trivial probability mass on the outcome where we have very little dignity and little to no progress on alignment by the time superhuman AI systems come up, and yet our civilization survives. In essence, I believe there’s a non-trivial probability of success with superhuman AI without dignity in the alignment problem.
Combine these two and my max p(doom) is 10%.
Nice! Thanks for sharing.
Seems like 2 is our crux; I definitely agree that alignment is solvable & that there has been some substantial progress on core alignment problems so far. (Though I really wouldn’t classify pretraining from human feedback as substantial progress on core problems—I’d be interested to hear more about this from you, wanna explain what core problem it’s helping solve?)
For our main crux, if you want to discuss now (fair if you don’t) I’d be interested to hear a concrete story (or sketch of a story) that you think is plausible in which we survive despite little dignity and basically no alignment progress before superhuman AI systems come up.
The major problem that was solved is outer alignment, which is essentially which goals we should pick for our AI, and in particular as you get more data, the AI gets more aligned.
This is crucial for aligning superhuman AI.
The PII task also touches on something very important for AI safety, which is can we prevent instrumentally convergent goals if they are aligned with human values? And the answer to that is a tentative yes, given that AI does less taking of personally identifiable information as it scales with more data.
A major part of this is coming from Holden Karnofsky’s success without dignity post that I’ll link below, and the major part of that story is that we may have a dumpster fire on our hands with AI safety, but that doesn’t mean we can’t succeed. It’s possible that the AI Alignment problem is really easy, such that a method just works, and conditioning on alignment being easy, while the world gets quite a bit more dangerous due to technology, a large proportion of aligned AIs vs a small proportion of Misaligned AIs is probably a scenario where humanity endures. It will be weird and dangerous, but probably not existential.
The story is linked below:
https://www.lesswrong.com/posts/jwhcXmigv2LTrbBiB/success-without-dignity-a-nearcasting-story-of-avoiding#comments
Thanks! Huh, I really don’t see how this solves outer alignment. I thought outer alignment was already mostly solved by some combination of imitation + amplification + focusing on intermediate narrow tasks like designing brain scanners… but yeah I guess I should go read the paper more carefully?
I don’t see how pretraining from human feedback has the property “as you get more data, the AI gets more aligned.” Why doesn’t it suffer from all the classic problems of inner alignment / deception / playing the training game / etc.?
I love Holden’s post but I give something like 10% credence that we get success via that route, rather than the 80% or so that it seems like you give, in order to get a mere 10% p(doom). The main crux for me here is timelines; I think things are going to happen too quickly for us to implement all the things Holden talks about in that post. Even though he describes them as pretty basic. Note that there are a ton of even more basic safety things that current AI labs aren’t implementing because they don’t think the problem is serious & they are under competitive pressure to race.