How much are ML systems likely to learn to “look good to the person training them” in a way that will generalize scarily to novel test-time situations, vs learning to straightforwardly do what we are trying to train them to do?
How much alien knowledge are ML systems likely to have? Will humans be able to basically understand what they are doing with some effort, or will it quickly become completely beyond us?
How much time will we have to adapt gradually as AI systems improve, and how fast will we be able to adapt? How similar will the problems that arise be to the ones we can anticipate now?
According to your internal model of the problem of AI safety, what are the main axes of disagreement researchers have?
The three that first come to mind:
How much are ML systems likely to learn to “look good to the person training them” in a way that will generalize scarily to novel test-time situations, vs learning to straightforwardly do what we are trying to train them to do?
How much alien knowledge are ML systems likely to have? Will humans be able to basically understand what they are doing with some effort, or will it quickly become completely beyond us?
How much time will we have to adapt gradually as AI systems improve, and how fast will we be able to adapt? How similar will the problems that arise be to the ones we can anticipate now?