Daniel Kokotajlo comments on What 2026 looks like

Daniel Kokotajlo 4 Apr 2023 22:22 UTC
2 points
Thanks! Huh, I really don’t see how this solves outer alignment. I thought outer alignment was already mostly solved by some combination of imitation + amplification + focusing on intermediate narrow tasks like designing brain scanners… but yeah I guess I should go read the paper more carefully?

I don’t see how pretraining from human feedback has the property “as you get more data, the AI gets more aligned.” Why doesn’t it suffer from all the classic problems of inner alignment / deception / playing the training game / etc.?

I love Holden’s post but I give something like 10% credence that we get success via that route, rather than the 80% or so that it seems like you give, in order to get a mere 10% p(doom). The main crux for me here is timelines; I think things are going to happen too quickly for us to implement all the things Holden talks about in that post. Even though he describes them as pretty basic. Note that there are a ton of even more basic safety things that current AI labs aren’t implementing because they don’t think the problem is serious & they are under competitive pressure to race.