Jan Leike has written about inner alignment here https://aligned.substack.com/p/inner-alignment. (I’m at OpenAI, imo I’m not sure if this will work in the worst case and I’m hoping we can come up with a more robust plan)
I can’t speak for OpenAI, but maybe the hope is that we don’t need to solve inner alignment in step 1. In step 1 we figure out how to get our narrow-ish, not-yet-superintelligent systems to help us with alignment research even though they aren’t fully aligned and can’t be trusted to scale up to superintelligence or learn certain dangerous skills. Then in step 2 we solve inner alignment and all remaining alignment problems using the help of those systems.
What’s your plan for inner alignment?
Jan Leike has written about inner alignment here https://aligned.substack.com/p/inner-alignment. (I’m at OpenAI, imo I’m not sure if this will work in the worst case and I’m hoping we can come up with a more robust plan)
Yeah, though he hasn’t specced out his plan.
I can’t speak for OpenAI, but maybe the hope is that we don’t need to solve inner alignment in step 1. In step 1 we figure out how to get our narrow-ish, not-yet-superintelligent systems to help us with alignment research even though they aren’t fully aligned and can’t be trusted to scale up to superintelligence or learn certain dangerous skills. Then in step 2 we solve inner alignment and all remaining alignment problems using the help of those systems.
Interesting idea. I guess that could be worth a shot if we lack anything better.
I don’t work for OpenAI. I just saw Sam Altman tweet this post, so I linkposted it here.