(Though I really wouldn’t classify pretraining from human feedback as substantial progress on core problems—I’d be interested to hear more about this from you, wanna explain what core problem it’s helping solve?)
The major problem that was solved is outer alignment, which is essentially which goals we should pick for our AI, and in particular as you get more data, the AI gets more aligned.
This is crucial for aligning superhuman AI.
The PII task also touches on something very important for AI safety, which is can we prevent instrumentally convergent goals if they are aligned with human values? And the answer to that is a tentative yes, given that AI does less taking of personally identifiable information as it scales with more data.
For our main crux, if you want to discuss now (fair if you don’t) I’d be interested to hear a concrete story (or sketch of a story) that you think is plausible in which we survive despite little dignity and basically no alignment progress before superhuman AI systems come up.
A major part of this is coming from Holden Karnofsky’s success without dignity post that I’ll link below, and the major part of that story is that we may have a dumpster fire on our hands with AI safety, but that doesn’t mean we can’t succeed. It’s possible that the AI Alignment problem is really easy, such that a method just works, and conditioning on alignment being easy, while the world gets quite a bit more dangerous due to technology, a large proportion of aligned AIs vs a small proportion of Misaligned AIs is probably a scenario where humanity endures. It will be weird and dangerous, but probably not existential.
Thanks! Huh, I really don’t see how this solves outer alignment. I thought outer alignment was already mostly solved by some combination of imitation + amplification + focusing on intermediate narrow tasks like designing brain scanners… but yeah I guess I should go read the paper more carefully?
I don’t see how pretraining from human feedback has the property “as you get more data, the AI gets more aligned.” Why doesn’t it suffer from all the classic problems of inner alignment / deception / playing the training game / etc.?
I love Holden’s post but I give something like 10% credence that we get success via that route, rather than the 80% or so that it seems like you give, in order to get a mere 10% p(doom). The main crux for me here is timelines; I think things are going to happen too quickly for us to implement all the things Holden talks about in that post. Even though he describes them as pretty basic. Note that there are a ton of even more basic safety things that current AI labs aren’t implementing because they don’t think the problem is serious & they are under competitive pressure to race.
The major problem that was solved is outer alignment, which is essentially which goals we should pick for our AI, and in particular as you get more data, the AI gets more aligned.
This is crucial for aligning superhuman AI.
The PII task also touches on something very important for AI safety, which is can we prevent instrumentally convergent goals if they are aligned with human values? And the answer to that is a tentative yes, given that AI does less taking of personally identifiable information as it scales with more data.
A major part of this is coming from Holden Karnofsky’s success without dignity post that I’ll link below, and the major part of that story is that we may have a dumpster fire on our hands with AI safety, but that doesn’t mean we can’t succeed. It’s possible that the AI Alignment problem is really easy, such that a method just works, and conditioning on alignment being easy, while the world gets quite a bit more dangerous due to technology, a large proportion of aligned AIs vs a small proportion of Misaligned AIs is probably a scenario where humanity endures. It will be weird and dangerous, but probably not existential.
The story is linked below:
https://www.lesswrong.com/posts/jwhcXmigv2LTrbBiB/success-without-dignity-a-nearcasting-story-of-avoiding#comments
Thanks! Huh, I really don’t see how this solves outer alignment. I thought outer alignment was already mostly solved by some combination of imitation + amplification + focusing on intermediate narrow tasks like designing brain scanners… but yeah I guess I should go read the paper more carefully?
I don’t see how pretraining from human feedback has the property “as you get more data, the AI gets more aligned.” Why doesn’t it suffer from all the classic problems of inner alignment / deception / playing the training game / etc.?
I love Holden’s post but I give something like 10% credence that we get success via that route, rather than the 80% or so that it seems like you give, in order to get a mere 10% p(doom). The main crux for me here is timelines; I think things are going to happen too quickly for us to implement all the things Holden talks about in that post. Even though he describes them as pretty basic. Note that there are a ton of even more basic safety things that current AI labs aren’t implementing because they don’t think the problem is serious & they are under competitive pressure to race.