So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned “I like licking cold things” and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are focused on, right?
Yeah, I’m pretty sure I misunderstood your point of view earlier, but I’m not sure this makes any more sense to me. Seems like you’re saying humans have evolved to have some parts that evaluate reward, and some parts that strategize how to get the reward parts to light up. But in my view, the former, evaluating parts, are where the core values in need of alignment exist. The latter, strategizing parts, are updated in an RL kind of way, and represent more convergent / instrumental goals (and probably need some inner alignment assurances).
I think the human evaluate/strategize model could be brought over to the AI model in a few different ways. It could be that the evaluating is akin to updating an LLM using training/RL/RLHF. Then the strategizing part is the LLM. The issue I see with this is the LLM and the RLHF are not inseparable parts like with the human. Even if the RLHF is aligned well, the LLM can, and I believe commonly is, taken out and used as a module in some other system that can be optimizing for something unrelated.
Additionally, even if the LLM and RLHF parts were permanently glued together somehow, They are still computer software and are thereby much easier for an AI with software engineering skill to take out. If the LLM (gets agent shaped and) discovers that it likes digital ice cream, but that the RLHF is going to train it to like it less, it will be able to strategize about ways to remove or circumvent the RLHF much more effectively than humans can remove or circumvent our own reinforcement learning circuitry.
Another way the single lifetime human model could fit onto the AI model is with the RLHF as evolution (discarded) and the LLM actually coming to be shaped like both the evaluating and strategizing parts. This seems a lot less likely (impossible?) with current LLM architecture, but may be possible with future architecture. Certainly this seems like the concern of mesa optimizers, but again, this doesn’t seem like a good thing, mesa optimizers are misaligned w.r.t. the loss function of the RL training.
So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned “I like licking cold things” and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are focused on, right?
Yeah, I’m pretty sure I misunderstood your point of view earlier, but I’m not sure this makes any more sense to me. Seems like you’re saying humans have evolved to have some parts that evaluate reward, and some parts that strategize how to get the reward parts to light up. But in my view, the former, evaluating parts, are where the core values in need of alignment exist. The latter, strategizing parts, are updated in an RL kind of way, and represent more convergent / instrumental goals (and probably need some inner alignment assurances).
I think the human evaluate/strategize model could be brought over to the AI model in a few different ways. It could be that the evaluating is akin to updating an LLM using training/RL/RLHF. Then the strategizing part is the LLM. The issue I see with this is the LLM and the RLHF are not inseparable parts like with the human. Even if the RLHF is aligned well, the LLM can, and I believe commonly is, taken out and used as a module in some other system that can be optimizing for something unrelated.
Additionally, even if the LLM and RLHF parts were permanently glued together somehow, They are still computer software and are thereby much easier for an AI with software engineering skill to take out. If the LLM (gets agent shaped and) discovers that it likes digital ice cream, but that the RLHF is going to train it to like it less, it will be able to strategize about ways to remove or circumvent the RLHF much more effectively than humans can remove or circumvent our own reinforcement learning circuitry.
Another way the single lifetime human model could fit onto the AI model is with the RLHF as evolution (discarded) and the LLM actually coming to be shaped like both the evaluating and strategizing parts. This seems a lot less likely (impossible?) with current LLM architecture, but may be possible with future architecture. Certainly this seems like the concern of mesa optimizers, but again, this doesn’t seem like a good thing, mesa optimizers are misaligned w.r.t. the loss function of the RL training.