Is the intention here that any number of Quirrel points is less good than the self-destruct option is bad? In other words, there is no N such that I would accept a fair coin flip on walking away with a score of N vs self destructing?
I’d like to say that it doesn’t matter, because if you ever pick something that may be unsafe, Quirrell probably chose it because he can make it self-destruct. But of course it is possible that as a bounded rationalist, a program seems safe but you aren’t entirely sure, and after taking Quirrell’s motivation into account, you’d assign 50% probability to it being unsafe—so yeah, your question could matter, and in order to get the kind of solution I want, you’d have to make the assumption you propose.
Furthermore, I might find myself in a situation where I was highly confident that Quirrell could not produce a self destruct in < n moves, but was uncertain of the result after that.
This seems to me much like probabilistic reasoning in a game like Chess or Go. There are moves that make things more or less complicated. There are moves that are clearly stable, and equally clearly sufficient or insufficient. Sometimes you take the unstable move, not knowing whether it’s good or bad, because the stable moves are insufficient. Sometimes you simply can’t read far enough into the future, and you have to hope that your opponent left you this option because they couldn’t either.
It seems to me to be relevant not only what Quirrell’s motivations are, but how much smarter he is. When playing Go, you don’t take the unstable moves against a stronger player. You trade your lead from the handicap stones for a more stable board position, and you keep making that trade until the game ends, and you hope you traded well enough—not that you got the good end of the trade, because you didn’t need to.
In the real world… What certainty of safety do you need on a strong AI that will stop all the suffering in the world, when screwing up will end the human race? Is one chance in a thousand good enough? How about one chance in 10^9? Certainly that question has an answer, and the answer isn’t complete certainty.
That said… I’m not sure allowing any sort of reasoning with uncertainty over future modifications helps anything here. In fact, I think it makes the problem harder.
It seems to me to be relevant not only what Quirrell’s motivations are, but how much smarter he is.
Right, I’m assuming insanely much.
Certainly that question has an answer, and the answer isn’t complete certainty. That said… I’m not sure allowing any sort of reasoning with uncertainty over future modifications helps anything here. In fact, I think it makes the problem harder.
I agree completely. The real problem we have to solve is much harder than the toy scenario in this post; the point of the toy scenario was to help focusing on one particular aspect of the problem.
Is the intention here that any number of Quirrel points is less good than the self-destruct option is bad? In other words, there is no N such that I would accept a fair coin flip on walking away with a score of N vs self destructing?
I’d like to say that it doesn’t matter, because if you ever pick something that may be unsafe, Quirrell probably chose it because he can make it self-destruct. But of course it is possible that as a bounded rationalist, a program seems safe but you aren’t entirely sure, and after taking Quirrell’s motivation into account, you’d assign 50% probability to it being unsafe—so yeah, your question could matter, and in order to get the kind of solution I want, you’d have to make the assumption you propose.
Furthermore, I might find myself in a situation where I was highly confident that Quirrell could not produce a self destruct in < n moves, but was uncertain of the result after that.
This seems to me much like probabilistic reasoning in a game like Chess or Go. There are moves that make things more or less complicated. There are moves that are clearly stable, and equally clearly sufficient or insufficient. Sometimes you take the unstable move, not knowing whether it’s good or bad, because the stable moves are insufficient. Sometimes you simply can’t read far enough into the future, and you have to hope that your opponent left you this option because they couldn’t either.
It seems to me to be relevant not only what Quirrell’s motivations are, but how much smarter he is. When playing Go, you don’t take the unstable moves against a stronger player. You trade your lead from the handicap stones for a more stable board position, and you keep making that trade until the game ends, and you hope you traded well enough—not that you got the good end of the trade, because you didn’t need to.
In the real world… What certainty of safety do you need on a strong AI that will stop all the suffering in the world, when screwing up will end the human race? Is one chance in a thousand good enough? How about one chance in 10^9? Certainly that question has an answer, and the answer isn’t complete certainty.
That said… I’m not sure allowing any sort of reasoning with uncertainty over future modifications helps anything here. In fact, I think it makes the problem harder.
Right, I’m assuming insanely much.
I agree completely. The real problem we have to solve is much harder than the toy scenario in this post; the point of the toy scenario was to help focusing on one particular aspect of the problem.