I applaud the work; detailed thinking about control efforts is one useful alignment strategy.
I find the assumption of no time-awareness in the AI to be unrealistic. Allowing continuous learning is going to be extremely tempting to anyone developing or deploying proto-AGI. Episodic memory (including across sessions) is fairly easy to implement, and can add capabilities as well as save compute costs. Working out useful knowledge and strategies will probably take nontrivial compute costs, and throwing away that knowledge after each session will seem very wasteful of resources and will limit usefulness.
The kind of mechanism you listed probably only allows the AIs to have a rough idea of what time it is.
We can keep some of the instances of our AI very unaware of the time, by restricting their between-episode memory. For example, we might do this for the instances responsible for untrusted monitoring, to reduce collusion.
I applaud the work; detailed thinking about control efforts is one useful alignment strategy.
I find the assumption of no time-awareness in the AI to be unrealistic. Allowing continuous learning is going to be extremely tempting to anyone developing or deploying proto-AGI. Episodic memory (including across sessions) is fairly easy to implement, and can add capabilities as well as save compute costs. Working out useful knowledge and strategies will probably take nontrivial compute costs, and throwing away that knowledge after each session will seem very wasteful of resources and will limit usefulness.
I agree re time-awareness, with two caveats:
The kind of mechanism you listed probably only allows the AIs to have a rough idea of what time it is.
We can keep some of the instances of our AI very unaware of the time, by restricting their between-episode memory. For example, we might do this for the instances responsible for untrusted monitoring, to reduce collusion.