I think rule-following may be fairly natural for minds to learn, much more so than other safety properties, and I think it might be worth more research into this direction.
Some pieces of weak evidence:
1. Most humans seem to learn and follow broad rules, even when they are capable of more detailed reasoning
2. Current LLMs seem to follow rules fairly well (and training often incentivizes them to learn broad heuristics rather than detailed, reflective reasoning)
While I do expect rule-following to become harder to instill as AIs become smarter, I think that if we are sufficiently careful, it may well scale to human-level AGIs. I think trying to align reflective, utilitarian-style AIs is probably really hard, as these agents are much more prone to small unavoidable alignment failures (like slight misspecification or misgeneralization) causing large shifts in behavior. Conversely, if we try our best to instill specific simpler rules, and then train these rules to take precedence over consequentialist reasoning whenever possible, this seems a lot safer.
I also think there is a bunch of tractable, empirical research that we can do right now about how to best do this.
Another reason this could be misleading is that it’s possible that sampling from the model pre-mode-collapse performs better at high k even if RL did teach the model new capabilities. In general, taking a large number of samples from a slightly worse model before mode collapse often outperforms taking a large number of samples from a better model after mode collapse — even if the RL model is actually more powerful (in the sense that its best reasoning pathways are better than the original model’s best reasoning pathways). In other words, the RL model can both be much smarter and still perform worse on these evaluations due to diversity loss. But it’s possible that if you’re, say, trying to train a model to accomplish a difficult/novel reasoning task, it’s much better to start with the smarter RL model than the more diverse original model.