The maximise p=P(cauldrun full) constrained by p<1−ϵ has really weird failure modes. This is how I think it would go. Take over the world, using a method that has <ϵ chance of failing. Build giant computers to calculate the exact chance of your takeover. Build a random bucket filler to make the probability work out. Ie if ϵ=3%, then the AI does its best to take over the world, once it succeeds it calculates that its plan had a 2% chance of failure. So it builds a bucket filler that has 9798 chance of working. This policy leaves the chance of the cauldron being filled at exactly 97%.
The idea (as I understood it from the post) isn’t to maximize P(goal) constrained by P(goal)<1-ϵ. The idea is to select actions in a way which maximizes P(goal|action), unless multiple actions achieve P(goal|action)>1-ϵ, in which case you choose between those randomly. Or the same but for policies rather than actions. Or the same but for plans rather than actions. Something in that general area.
One general concern is tiling—it seems very possible that such a system would build a successor-agent with the limitations taken off, since this is an easy way to achieve its objective, especially in an environment in which one AGI has (apparently) already been constructed.
Another problem with this solution to AI safety is that we can’t apply it to, say, a big neural network. The NN may internally perform search-based planning to achieve high performance on a task, but we don’t have access to that, to prevent it from “trying too hard”.
We could, of course, apply it to NN training, making sure to stop our training before the NN becomes so capable that we have safety concerns. But the big problem there is knowing where to stop. If you train the NN enough to be highly capable, it may already be implementing the kind of internal search that we would be concerned about. But this suggests your stopping point has to be before NNs could develop internal search. Unfortunately we’ve already seen NNs learn to search (granted, with some prodding to do so—I believe there have been other examples, but I didn’t find what I remembered). So this constraint has already been violated, and it would presumably be really difficult to convince people to reel things back in.
The maximise p=P(cauldrun full) constrained by p<1−ϵ has really weird failure modes. This is how I think it would go. Take over the world, using a method that has <ϵ chance of failing. Build giant computers to calculate the exact chance of your takeover. Build a random bucket filler to make the probability work out. Ie if ϵ=3%, then the AI does its best to take over the world, once it succeeds it calculates that its plan had a 2% chance of failure. So it builds a bucket filler that has 9798 chance of working. This policy leaves the chance of the cauldron being filled at exactly 97%.
The idea (as I understood it from the post) isn’t to maximize P(goal) constrained by P(goal)<1-ϵ. The idea is to select actions in a way which maximizes P(goal|action), unless multiple actions achieve P(goal|action)>1-ϵ, in which case you choose between those randomly. Or the same but for policies rather than actions. Or the same but for plans rather than actions. Something in that general area.
One general concern is tiling—it seems very possible that such a system would build a successor-agent with the limitations taken off, since this is an easy way to achieve its objective, especially in an environment in which one AGI has (apparently) already been constructed.
Another problem with this solution to AI safety is that we can’t apply it to, say, a big neural network. The NN may internally perform search-based planning to achieve high performance on a task, but we don’t have access to that, to prevent it from “trying too hard”.
We could, of course, apply it to NN training, making sure to stop our training before the NN becomes so capable that we have safety concerns. But the big problem there is knowing where to stop. If you train the NN enough to be highly capable, it may already be implementing the kind of internal search that we would be concerned about. But this suggests your stopping point has to be before NNs could develop internal search. Unfortunately we’ve already seen NNs learn to search (granted, with some prodding to do so—I believe there have been other examples, but I didn’t find what I remembered). So this constraint has already been violated, and it would presumably be really difficult to convince people to reel things back in.