To which you might reply, “Fine, cute trick, but that doesn’t help with the real alignment problem, which is that eventually someone will invent a powerful optimizer with a coherent preference ordering over world-states, which will kill us.”
To which the other might reply, “Okay, I agree that we don’t know how to align an arbitrarily powerful optimizer with a coherent preference ordering over world-states, but if your theory predicts that we can’t aim AI systems at low-impact tasks via training, you have to be getting something wrong, because people are absolutely doing that right now, by treating it as a mundane engineering problem in the current paradigm.”
To which you might reply, “We predict that the mundane engineering approach will break down once the systems are powerful enough to come up with plans that humans can’t supervise”?
eventually someone will invent a powerful optimizer with a coherent preference ordering over world-states, which will kill us.
It’s unlikely that any realistic AI will be perfectly coherent , or have exact preferences over works states. The first is roughly equivalent to the Frame Problem , the second is defeated by embededness.
The obvious question here is to what degree do you need new techniques vs merely to train new models with the same techniques as you scale current approaches.
One of the virtues of the deep learning paradigm is that you can usually test things at small scale (where the models are not and will never be especially smart) and there’s a smooth range of scaling regimes in between where things tend to generalize.
If you need fundamentally different techniques at different scales, and the large scale techniques do not work at intermediate and small scales, then you might have a problem. If you need the same techniques as at medium or small scales for large scales, then engineering continues to be tractable even as algorithmic advances obsolete old approaches.
To this, the deep-learning-has-alignment-implications proponent replies: “But simple small-scale tasks don’t require maximizing a coherent preference ordering over world-states. We can already hook up an LLM to a robot and have it obey natural-language commands in a reasonable way.”
To which you might reply, “Fine, cute trick, but that doesn’t help with the real alignment problem, which is that eventually someone will invent a powerful optimizer with a coherent preference ordering over world-states, which will kill us.”
To which the other might reply, “Okay, I agree that we don’t know how to align an arbitrarily powerful optimizer with a coherent preference ordering over world-states, but if your theory predicts that we can’t aim AI systems at low-impact tasks via training, you have to be getting something wrong, because people are absolutely doing that right now, by treating it as a mundane engineering problem in the current paradigm.”
To which you might reply, “We predict that the mundane engineering approach will break down once the systems are powerful enough to come up with plans that humans can’t supervise”?
It’s unlikely that any realistic AI will be perfectly coherent , or have exact preferences over works states. The first is roughly equivalent to the Frame Problem , the second is defeated by embededness.
The obvious question here is to what degree do you need new techniques vs merely to train new models with the same techniques as you scale current approaches.
One of the virtues of the deep learning paradigm is that you can usually test things at small scale (where the models are not and will never be especially smart) and there’s a smooth range of scaling regimes in between where things tend to generalize.
If you need fundamentally different techniques at different scales, and the large scale techniques do not work at intermediate and small scales, then you might have a problem. If you need the same techniques as at medium or small scales for large scales, then engineering continues to be tractable even as algorithmic advances obsolete old approaches.