True enough, but once we step outside of the thought experiment and take a look at the idea it is intended to represent, “button gets pressed” translates into “humanity gets convinced to accept the machine’s proposal”. Since the AI-analogue device has no motives or desires save to model the universe as perfectly as possible, P(A bit flips in the AI that leads to it convincing a human panel to do something bad) necessarily drops below P(A bit flips anywhere that leads to a human panel deciding to do something bad) and is discountable for the same reason why we ignore hypothesises like “Maybe a cosmic ray flipped a bit to make it do that?” when figuring out the source of computer errors in general.
P(A bit flips in the AI that leads to it convincing a human panel to do something bad) is always less than P(A bit flips anywhere that leads to a human panel deciding to do something bad), (the former is a subset of the latter).
The point of the cosmic ray statement is not so much that that might actually happen, but is just demonstrating that the Outcome-Pump-2.0-universe doesn’t necessarily result in a positive outcome, just that it is a universe that has had the “Outcome” accepted, and also that the Outcome being accepted doesn’t imply that the universe is one we like.
True enough, but once we step outside of the thought experiment and take a look at the idea it is intended to represent, “button gets pressed” translates into “humanity gets convinced to accept the machine’s proposal”. Since the AI-analogue device has no motives or desires save to model the universe as perfectly as possible, P(A bit flips in the AI that leads to it convincing a human panel to do something bad) necessarily drops below P(A bit flips anywhere that leads to a human panel deciding to do something bad) and is discountable for the same reason why we ignore hypothesises like “Maybe a cosmic ray flipped a bit to make it do that?” when figuring out the source of computer errors in general.
P(A bit flips in the AI that leads to it convincing a human panel to do something bad) is always less than P(A bit flips anywhere that leads to a human panel deciding to do something bad), (the former is a subset of the latter).
The point of the cosmic ray statement is not so much that that might actually happen, but is just demonstrating that the Outcome-Pump-2.0-universe doesn’t necessarily result in a positive outcome, just that it is a universe that has had the “Outcome” accepted, and also that the Outcome being accepted doesn’t imply that the universe is one we like.