We should be thinking through the problems with that!
It’s not the worst idea in the world!
EDIT: note that the obvious failure mode here is “but then you retrain the language model as part of your RL loop, and language loses its meaning, and then it does something evil and then everyone dies.” So everyone still has to not do that! But this makes me think that ~human-level alignment in controlled circumstances might not be impossible.
It doesn’t change the fact that if anyone thinks it would be fun to fine-tune the whole model on an RL objective, we’d still lose. So we have to do global coordination ASAP. (It does not seem at all likely that Yudkowskian pivotal acts can be achieved solely through the logic learned from getting ~perfect LLM accuracy on the entire internet, since all proposed pivotal acts require new concepts no one has ever discussed.)
Yeah, don’t do RL on it, but instead use it to make money for you (ethically) and at the same time ask it to think about how to create a safe/aligned superintelligent AGI. You may still need a big enough lead (to prevent others doing RL outcompeting you) or global coordination but it doesn’t seem obviously impossible.
Pretty much. I also think this plausibly buys off the actors who are currently really excited about AGI. They can make silly money with such a system without the RL part—why not do that for a while, while mutually-enforcing the “nobody kill everyone” provisions?
Maybe!
We should be thinking through the problems with that!
It’s not the worst idea in the world!
EDIT: note that the obvious failure mode here is “but then you retrain the language model as part of your RL loop, and language loses its meaning, and then it does something evil and then everyone dies.” So everyone still has to not do that! But this makes me think that ~human-level alignment in controlled circumstances might not be impossible.
It doesn’t change the fact that if anyone thinks it would be fun to fine-tune the whole model on an RL objective, we’d still lose. So we have to do global coordination ASAP. (It does not seem at all likely that Yudkowskian pivotal acts can be achieved solely through the logic learned from getting ~perfect LLM accuracy on the entire internet, since all proposed pivotal acts require new concepts no one has ever discussed.)
Yeah, don’t do RL on it, but instead use it to make money for you (ethically) and at the same time ask it to think about how to create a safe/aligned superintelligent AGI. You may still need a big enough lead (to prevent others doing RL outcompeting you) or global coordination but it doesn’t seem obviously impossible.
Pretty much. I also think this plausibly buys off the actors who are currently really excited about AGI. They can make silly money with such a system without the RL part—why not do that for a while, while mutually-enforcing the “nobody kill everyone” provisions?