I’ve heard this alignment plan that is a variation of ‘simulate top alignment researchers’ with an LLM. Usually the poor alignment researcher in question is Paul.
This strikes me as deeply unserious and I am confused why it is having so much traction.
That AI-assisted alignment is coming (indeed, is already here!) is undeniable. But even somewhat accurately simulating a human from textdata is a crazy sci-fi ability, probably not even physically possible. It seems to ascribe nearly magical abilities to LLMs.
Predicting a partially observable process is fundamentally hard. Even in very easy cases there are simple cases where one can give a generative (partially observable) model with just two states (the unifilar source) that needs an infinity of states to predict optimally. In more generic cases the expectation is that this is far worse.
Error compound over time (or continuation length). Even a tiny amount of noise would throw off simulation.
Okay maybe people just mean that GPT-N will kinda know what Paul approximately would be looking at. I think this is plausible in very broad brush strokes but it seems misleading to call this ‘simulation’.
Alignment by Simulation?
I’ve heard this alignment plan that is a variation of ‘simulate top alignment researchers’ with an LLM. Usually the poor alignment researcher in question is Paul.
This strikes me as deeply unserious and I am confused why it is having so much traction.
That AI-assisted alignment is coming (indeed, is already here!) is undeniable. But even somewhat accurately simulating a human from textdata is a crazy sci-fi ability, probably not even physically possible. It seems to ascribe nearly magical abilities to LLMs.
Predicting a partially observable process is fundamentally hard. Even in very easy cases there are simple cases where one can give a generative (partially observable) model with just two states (the unifilar source) that needs an infinity of states to predict optimally. In more generic cases the expectation is that this is far worse.
Error compound over time (or continuation length). Even a tiny amount of noise would throw off simulation.
Okay maybe people just mean that GPT-N will kinda know what Paul approximately would be looking at. I think this is plausible in very broad brush strokes but it seems misleading to call this ‘simulation’.