I think AI systems should be designed in such a way to avoid being susceptible to sign flips (as Eliezer argues in that post you linked), but also suspect this is likely to happen naturally in the course of developing the systems. While a sign flip may occur in some local area, you’d have to have just no checksums on the process for the result of a sign-flipped reward function to end up in control.
One is the difference between training time and deployment, as others have mentioned. But the other is that I’m skeptical that there will be a singleton AI that was just trained via reinforcement learning.
Like, we’re going to train a single neural network end-to-end on running the world? And just hand over the economy to it? I don’t think that’s how it’s going to go. There will be interlocking more-and-more powerful systems. See: Arguments about fast takeoff.
I think AI systems should be designed in such a way to avoid being susceptible to sign flips (as Eliezer argues in that post you linked), but also suspect this is likely to happen naturally in the course of developing the systems. While a sign flip may occur in some local area, you’d have to have just no checksums on the process for the result of a sign-flipped reward function to end up in control.
What do you think the difference would be between an AGI’s reward function, and that of GPT-2 during the error it experienced?
One is the difference between training time and deployment, as others have mentioned. But the other is that I’m skeptical that there will be a singleton AI that was just trained via reinforcement learning.
Like, we’re going to train a single neural network end-to-end on running the world? And just hand over the economy to it? I don’t think that’s how it’s going to go. There will be interlocking more-and-more powerful systems. See: Arguments about fast takeoff.