My intuition is that a simulation such as the one being proposed would take far longer to develop than the timeline outlined in this post. I’d posit that the timeline would be closer to 60 years than 6.
Also, a suggestion for tl;dr: The Truman Show for AI.
Davidad seems to be aiming for what I’d call infeasible rigor, presumably in hope of getting something that would make me more than 95% confident of success.
I expect we could get to 80% confidence with this basic approach, by weakening the expected precision of the world model, and evaluating the AI on a variety of simulated worlds, to demonstrate that the AI’s alignment is not too sensitive to the choice of worlds. Something along the lines of the simulations in Jake Cannell’s LOVE in a simbox.
Is 80% confidence the best we can achieve? I don’t know.
My understanding is that the world model is more like a very coarse projection of the world than a simulation
It’s not the case that the AGI has to be fooled into thinking the simulation is real like in the Truman Show (I like name tho!).
Davidad only tries to achieve ‘safety’ - not alignment. Indeed the AI may be fully unaligned.
The proposal is different from simulation propoals like Jacob Cannell’s LOVE in a simbox where one tries to align the values of the AI.
In davidads proposal the actual AGI is physically boxed and cannot interact with the world except through proposing policies inside this worldmodel (which get formally checked in the second stage).
One way of thinking about is that davidads proposal is really an elaborate boxing protocol but there are multiple boxes here:
The physical Faraday cage that houses the hardware
The interface constraint that constraints the AI to only output into the formal world model
The formal cage that is achieved by verifying the behaviour through mathmagic.
Although the technical challenges seems daunting, especially on such short timelines this is not where I am most skeptical.
The key problem, like all boxing proposals, is more of a governance and coordination problem.
Generally agree with that intuition, but as others point out the real Truman Show for AI is simboxing (which is related, but focuses more on knowledge containment to avoid deception issues during evaluations).
Davidad is going for more formal safety where you can mostly automatically verify all the knowledge in the agent’s world model, presumably verify prediction rollouts, verify the selected actions correspond to futures that satisfy the bargaining solution, etc etc. The problem is this translates to a heavy set of constraints on the AGI architecture.
LOVES is instead built on the assumption that alignment can not strongly dictate the architecture—as the AGI architecture is determined by the competitive requirements of efficient intelligence/inference (and so is most likely a large brain-like ANN). We then search for the hyperparams and minimal tweaks which maximize alignment on top of that (ANN) architecture.
My intuition is that a simulation such as the one being proposed would take far longer to develop than the timeline outlined in this post. I’d posit that the timeline would be closer to 60 years than 6.
Also, a suggestion for tl;dr: The Truman Show for AI.
Agreed.
Davidad seems to be aiming for what I’d call infeasible rigor, presumably in hope of getting something that would make me more than 95% confident of success.
I expect we could get to 80% confidence with this basic approach, by weakening the expected precision of the world model, and evaluating the AI on a variety of simulated worlds, to demonstrate that the AI’s alignment is not too sensitive to the choice of worlds. Something along the lines of the simulations in Jake Cannell’s LOVE in a simbox.
Is 80% confidence the best we can achieve? I don’t know.
My understanding is that the world model is more like a very coarse projection of the world than a simulation
It’s not the case that the AGI has to be fooled into thinking the simulation is real like in the Truman Show (I like name tho!).
Davidad only tries to achieve ‘safety’ - not alignment. Indeed the AI may be fully unaligned.
The proposal is different from simulation propoals like Jacob Cannell’s LOVE in a simbox where one tries to align the values of the AI.
In davidads proposal the actual AGI is physically boxed and cannot interact with the world except through proposing policies inside this worldmodel (which get formally checked in the second stage).
One way of thinking about is that davidads proposal is really an elaborate boxing protocol but there are multiple boxes here:
The physical Faraday cage that houses the hardware The interface constraint that constraints the AI to only output into the formal world model The formal cage that is achieved by verifying the behaviour through mathmagic.
Although the technical challenges seems daunting, especially on such short timelines this is not where I am most skeptical. The key problem, like all boxing proposals, is more of a governance and coordination problem.
Generally agree with that intuition, but as others point out the real Truman Show for AI is simboxing (which is related, but focuses more on knowledge containment to avoid deception issues during evaluations).
Davidad is going for more formal safety where you can mostly automatically verify all the knowledge in the agent’s world model, presumably verify prediction rollouts, verify the selected actions correspond to futures that satisfy the bargaining solution, etc etc. The problem is this translates to a heavy set of constraints on the AGI architecture.
LOVES is instead built on the assumption that alignment can not strongly dictate the architecture—as the AGI architecture is determined by the competitive requirements of efficient intelligence/inference (and so is most likely a large brain-like ANN). We then search for the hyperparams and minimal tweaks which maximize alignment on top of that (ANN) architecture.
I don’t have the same intuition for the timeline, but I really like the tl;dr suggestion!