Generally agree with that intuition, but as others point out the real Truman Show for AI is simboxing (which is related, but focuses more on knowledge containment to avoid deception issues during evaluations).
Davidad is going for more formal safety where you can mostly automatically verify all the knowledge in the agent’s world model, presumably verify prediction rollouts, verify the selected actions correspond to futures that satisfy the bargaining solution, etc etc. The problem is this translates to a heavy set of constraints on the AGI architecture.
LOVES is instead built on the assumption that alignment can not strongly dictate the architecture—as the AGI architecture is determined by the competitive requirements of efficient intelligence/inference (and so is most likely a large brain-like ANN). We then search for the hyperparams and minimal tweaks which maximize alignment on top of that (ANN) architecture.
Generally agree with that intuition, but as others point out the real Truman Show for AI is simboxing (which is related, but focuses more on knowledge containment to avoid deception issues during evaluations).
Davidad is going for more formal safety where you can mostly automatically verify all the knowledge in the agent’s world model, presumably verify prediction rollouts, verify the selected actions correspond to futures that satisfy the bargaining solution, etc etc. The problem is this translates to a heavy set of constraints on the AGI architecture.
LOVES is instead built on the assumption that alignment can not strongly dictate the architecture—as the AGI architecture is determined by the competitive requirements of efficient intelligence/inference (and so is most likely a large brain-like ANN). We then search for the hyperparams and minimal tweaks which maximize alignment on top of that (ANN) architecture.