Random thought: Perhaps you could carefully engineer gradient starvation in order to “avoid generalizing” and defeat the Discrete modes of prediction example. You’d only need to delay it until reflection, then the AI can solve the successor AI problem.
In general: hack our way towards getting value-preserving reflectivity before values drift from “Diamonds” → “What’s labeled as a diamond by humans”. (Replacing with “Telling the truth”, and “What the human thinks is true” respectively).
Random thought: Perhaps you could carefully engineer gradient starvation in order to “avoid generalizing” and defeat the Discrete modes of prediction example. You’d only need to delay it until reflection, then the AI can solve the successor AI problem.
In general: hack our way towards getting value-preserving reflectivity before values drift from “Diamonds” → “What’s labeled as a diamond by humans”. (Replacing with “Telling the truth”, and “What the human thinks is true” respectively).