I haven’t exactly been going around preaching my views, though I haven’t been hiding them either. I hope the research I do will feed up the chain into better decisions at the top. Most of my time is spent working on dangerous capability evals (similar to what ARC did) these days. I’m actually tentatively optimistic that the plan they feed into (society coordinates to not build AIs with sufficiently dangerous capabilities, until more alignment & security work has been done) might actually succeed. Overall my p(doom) is at something like 80%, because I think the situation humanity is in is pretty dire, but I can at least see a speck of hope, a shoreline to swim towards so to speak.
Thanks again! I think your perspective is super valuable.
I agree that the ARC eval, and further like it, are one of my biggest streams of hope that we’ll get a clear sign to truly take a pause on capabilities before it’s potentially too late. So thank you in addition for doing that work!
I have this cognitive dissonance where I feel like my p(doom) should be more in line with yours, but for some reason I have some weird faith in us to pull through here somehow. Not perfectly, not without damage, but somehow...I don’t know why, though, exactly. Maybe this recent societal response the last week or two gives me hope that all of this will be taken very seriously?
I have this cognitive dissonance where I feel like my p(doom) should be more in line with yours, but for some reason I have some weird faith in us to pull through here somehow. Not perfectly, not without damage, but somehow...
I think the reason I don’t align with Daniel Kokotajlo’s p(doom) is 2 things:
I suspect alignment is while not easy, I do think it’s solvable and in particular, I’m very impressed by the progress on core alignment problems to date. In particular, Pretraining from Human Feedback did very well on solving certain problems of alignment.
I have non-trivial probability mass on the outcome where we have very little dignity and little to no progress on alignment by the time superhuman AI systems come up, and yet our civilization survives. In essence, I believe there’s a non-trivial probability of success with superhuman AI without dignity in the alignment problem.
Seems like 2 is our crux; I definitely agree that alignment is solvable & that there has been some substantial progress on core alignment problems so far. (Though I really wouldn’t classify pretraining from human feedback as substantial progress on core problems—I’d be interested to hear more about this from you, wanna explain what core problem it’s helping solve?)
For our main crux, if you want to discuss now (fair if you don’t) I’d be interested to hear a concrete story (or sketch of a story) that you think is plausible in which we survive despite little dignity and basically no alignment progress before superhuman AI systems come up.
(Though I really wouldn’t classify pretraining from human feedback as substantial progress on core problems—I’d be interested to hear more about this from you, wanna explain what core problem it’s helping solve?)
The major problem that was solved is outer alignment, which is essentially which goals we should pick for our AI, and in particular as you get more data, the AI gets more aligned.
This is crucial for aligning superhuman AI.
The PII task also touches on something very important for AI safety, which is can we prevent instrumentally convergent goals if they are aligned with human values? And the answer to that is a tentative yes, given that AI does less taking of personally identifiable information as it scales with more data.
For our main crux, if you want to discuss now (fair if you don’t) I’d be interested to hear a concrete story (or sketch of a story) that you think is plausible in which we survive despite little dignity and basically no alignment progress before superhuman AI systems come up.
A major part of this is coming from Holden Karnofsky’s success without dignity post that I’ll link below, and the major part of that story is that we may have a dumpster fire on our hands with AI safety, but that doesn’t mean we can’t succeed. It’s possible that the AI Alignment problem is really easy, such that a method just works, and conditioning on alignment being easy, while the world gets quite a bit more dangerous due to technology, a large proportion of aligned AIs vs a small proportion of Misaligned AIs is probably a scenario where humanity endures. It will be weird and dangerous, but probably not existential.
Thanks! Huh, I really don’t see how this solves outer alignment. I thought outer alignment was already mostly solved by some combination of imitation + amplification + focusing on intermediate narrow tasks like designing brain scanners… but yeah I guess I should go read the paper more carefully?
I don’t see how pretraining from human feedback has the property “as you get more data, the AI gets more aligned.” Why doesn’t it suffer from all the classic problems of inner alignment / deception / playing the training game / etc.?
I love Holden’s post but I give something like 10% credence that we get success via that route, rather than the 80% or so that it seems like you give, in order to get a mere 10% p(doom). The main crux for me here is timelines; I think things are going to happen too quickly for us to implement all the things Holden talks about in that post. Even though he describes them as pretty basic. Note that there are a ton of even more basic safety things that current AI labs aren’t implementing because they don’t think the problem is serious & they are under competitive pressure to race.
I haven’t exactly been going around preaching my views, though I haven’t been hiding them either. I hope the research I do will feed up the chain into better decisions at the top. Most of my time is spent working on dangerous capability evals (similar to what ARC did) these days. I’m actually tentatively optimistic that the plan they feed into (society coordinates to not build AIs with sufficiently dangerous capabilities, until more alignment & security work has been done) might actually succeed. Overall my p(doom) is at something like 80%, because I think the situation humanity is in is pretty dire, but I can at least see a speck of hope, a shoreline to swim towards so to speak.
Thanks again! I think your perspective is super valuable.
I agree that the ARC eval, and further like it, are one of my biggest streams of hope that we’ll get a clear sign to truly take a pause on capabilities before it’s potentially too late. So thank you in addition for doing that work!
I have this cognitive dissonance where I feel like my p(doom) should be more in line with yours, but for some reason I have some weird faith in us to pull through here somehow. Not perfectly, not without damage, but somehow...I don’t know why, though, exactly. Maybe this recent societal response the last week or two gives me hope that all of this will be taken very seriously?
I think the reason I don’t align with Daniel Kokotajlo’s p(doom) is 2 things:
I suspect alignment is while not easy, I do think it’s solvable and in particular, I’m very impressed by the progress on core alignment problems to date. In particular, Pretraining from Human Feedback did very well on solving certain problems of alignment.
I have non-trivial probability mass on the outcome where we have very little dignity and little to no progress on alignment by the time superhuman AI systems come up, and yet our civilization survives. In essence, I believe there’s a non-trivial probability of success with superhuman AI without dignity in the alignment problem.
Combine these two and my max p(doom) is 10%.
Nice! Thanks for sharing.
Seems like 2 is our crux; I definitely agree that alignment is solvable & that there has been some substantial progress on core alignment problems so far. (Though I really wouldn’t classify pretraining from human feedback as substantial progress on core problems—I’d be interested to hear more about this from you, wanna explain what core problem it’s helping solve?)
For our main crux, if you want to discuss now (fair if you don’t) I’d be interested to hear a concrete story (or sketch of a story) that you think is plausible in which we survive despite little dignity and basically no alignment progress before superhuman AI systems come up.
The major problem that was solved is outer alignment, which is essentially which goals we should pick for our AI, and in particular as you get more data, the AI gets more aligned.
This is crucial for aligning superhuman AI.
The PII task also touches on something very important for AI safety, which is can we prevent instrumentally convergent goals if they are aligned with human values? And the answer to that is a tentative yes, given that AI does less taking of personally identifiable information as it scales with more data.
A major part of this is coming from Holden Karnofsky’s success without dignity post that I’ll link below, and the major part of that story is that we may have a dumpster fire on our hands with AI safety, but that doesn’t mean we can’t succeed. It’s possible that the AI Alignment problem is really easy, such that a method just works, and conditioning on alignment being easy, while the world gets quite a bit more dangerous due to technology, a large proportion of aligned AIs vs a small proportion of Misaligned AIs is probably a scenario where humanity endures. It will be weird and dangerous, but probably not existential.
The story is linked below:
https://www.lesswrong.com/posts/jwhcXmigv2LTrbBiB/success-without-dignity-a-nearcasting-story-of-avoiding#comments
Thanks! Huh, I really don’t see how this solves outer alignment. I thought outer alignment was already mostly solved by some combination of imitation + amplification + focusing on intermediate narrow tasks like designing brain scanners… but yeah I guess I should go read the paper more carefully?
I don’t see how pretraining from human feedback has the property “as you get more data, the AI gets more aligned.” Why doesn’t it suffer from all the classic problems of inner alignment / deception / playing the training game / etc.?
I love Holden’s post but I give something like 10% credence that we get success via that route, rather than the 80% or so that it seems like you give, in order to get a mere 10% p(doom). The main crux for me here is timelines; I think things are going to happen too quickly for us to implement all the things Holden talks about in that post. Even though he describes them as pretty basic. Note that there are a ton of even more basic safety things that current AI labs aren’t implementing because they don’t think the problem is serious & they are under competitive pressure to race.