Thumbs up for trying to think of novel approaches to solving the alignment problem.
Every time the model does something that harms the utility function of the dumber models, it gets a loss function.
A few confusions:
By “it gets a loss function”, did you mean “it gets negative reward”?
If yes, doesn’t this plan consist entirely of reinforcement learning? How does this “emulate Evolution”?
What exactly does the quoted sentence mean? Does the smarter model (S) receive RL signals proportional to… changes in the dumber agents’ (D’s) total utility?
Some problems, off the top of my head:
GPT-like models don’t have utility functions.
Even if they did, mechinterp is nowhere near advanced enough to be able to reveal models’ utility functions.
Humans don’t have utility functions. It’s unclear how this would generalize to human-alignment.
It’s very much unclear what policy S would end up learning in this RL setup. It’s even less clear how that policy would generalize outside of training.
If S is given reward proportional to (changes in) D’s utility, then basically we’re just training S with D’s utility function. I.e., just training some arbitrary RL policy/agent. Not much to do with alignment, AFAICT. [1]
If S is instead given reward for things like {taking actions that lead to obtaining information about D’s utility function}, then… we’re training an RL policy/agent on proxies to “alignment”. I expect that kind of approach to break down badly (due to Goodhart) when S becomes highly capable.
I don’t know how you arrived at this plan, but I’m guessing it involved reasoning with highly abstract and vague concepts. You might be interested in (i.a.) these tools/techniques:
Except maybe if you somehow managed to have the entire simulation be a very accurate model of the real world, and D’s be very accurate models of humans. But that’s not remotely realistic; and still subject to Goodhart.
Thumbs up for trying to think of novel approaches to solving the alignment problem.
A few confusions:
By “it gets a loss function”, did you mean “it gets negative reward”?
If yes, doesn’t this plan consist entirely of reinforcement learning? How does this “emulate Evolution”?
What exactly does the quoted sentence mean? Does the smarter model (S) receive RL signals proportional to… changes in the dumber agents’ (D’s) total utility?
Some problems, off the top of my head:
GPT-like models don’t have utility functions.
Even if they did, mechinterp is nowhere near advanced enough to be able to reveal models’ utility functions.
Humans don’t have utility functions. It’s unclear how this would generalize to human-alignment.
It’s very much unclear what policy S would end up learning in this RL setup. It’s even less clear how that policy would generalize outside of training.
If S is given reward proportional to (changes in) D’s utility, then basically we’re just training S with D’s utility function. I.e., just training some arbitrary RL policy/agent. Not much to do with alignment, AFAICT. [1]
If S is instead given reward for things like {taking actions that lead to obtaining information about D’s utility function}, then… we’re training an RL policy/agent on proxies to “alignment”. I expect that kind of approach to break down badly (due to Goodhart) when S becomes highly capable.
I don’t know how you arrived at this plan, but I’m guessing it involved reasoning with highly abstract and vague concepts. You might be interested in (i.a.) these tools/techniques:
https://www.lesswrong.com/posts/GKfPL6LQFgB49FEnv/replace-the-symbol-with-the-substance
https://www.lesswrong.com/posts/JcpzFpPBSmzuksmWM/the-5-second-level
Except maybe if you somehow managed to have the entire simulation be a very accurate model of the real world, and D’s be very accurate models of humans. But that’s not remotely realistic; and still subject to Goodhart.