seem very implausible when considered in the context of the human learning process (could a human’s visual cortex become “deceptively aligned” to the objective of modeling their visual field?).
I think it would probably be strange for the visual field to do this. But I think it’s not that uncommon for other parts of the brain to do this; higher level, most abstract / “psychological” parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call ‘deceptively aligned’ when they’re maladaptive. The idea of metacognitive blindspots also seems related.
I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, ‘most abstract / “psychological”’ are more entangled in future decision-making. They’re more “online”, with greater ability to influence their future training data.
I think it’s not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such loops rely on being able to influence the externally visible data collection process, rather than being invisibly baked into the prior. They are thus much more amenable to being addressed with scalable oversight approaches.
I think it would probably be strange for the visual field to do this. But I think it’s not that uncommon for other parts of the brain to do this; higher level, most abstract / “psychological” parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call ‘deceptively aligned’ when they’re maladaptive. The idea of metacognitive blindspots also seems related.
I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, ‘most abstract / “psychological”’ are more entangled in future decision-making. They’re more “online”, with greater ability to influence their future training data.
I think it’s not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such loops rely on being able to influence the externally visible data collection process, rather than being invisibly baked into the prior. They are thus much more amenable to being addressed with scalable oversight approaches.