Davidad seems to be aiming for what I’d call infeasible rigor, presumably in hope of getting something that would make me more than 95% confident of success.
I expect we could get to 80% confidence with this basic approach, by weakening the expected precision of the world model, and evaluating the AI on a variety of simulated worlds, to demonstrate that the AI’s alignment is not too sensitive to the choice of worlds. Something along the lines of the simulations in Jake Cannell’s LOVE in a simbox.
Is 80% confidence the best we can achieve? I don’t know.
Agreed.
Davidad seems to be aiming for what I’d call infeasible rigor, presumably in hope of getting something that would make me more than 95% confident of success.
I expect we could get to 80% confidence with this basic approach, by weakening the expected precision of the world model, and evaluating the AI on a variety of simulated worlds, to demonstrate that the AI’s alignment is not too sensitive to the choice of worlds. Something along the lines of the simulations in Jake Cannell’s LOVE in a simbox.
Is 80% confidence the best we can achieve? I don’t know.