Curated. It’s nice to see a return to the problems of yore, and I think this is a nice incremental proposal. Bringing in causal counterfactuals seems like a neat trick (with lots of problems, as discussed in the post and the comments), and so does bringing in some bargaining theory.
I have lots of confusions and questions, like
so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting
doesn’t make sense to me yet, as it seems easy for the utility functions / belief states to all prefer killing humans quickly, even if the humans don’t affect the shutdown button exactly. Or the aside on bargaining with non-causally-counterfacting agents. But they’re confusions and questions that afford some mulling, which is pretty cool!
Curated. It’s nice to see a return to the problems of yore, and I think this is a nice incremental proposal. Bringing in causal counterfactuals seems like a neat trick (with lots of problems, as discussed in the post and the comments), and so does bringing in some bargaining theory.
I have lots of confusions and questions, like
doesn’t make sense to me yet, as it seems easy for the utility functions / belief states to all prefer killing humans quickly, even if the humans don’t affect the shutdown button exactly. Or the aside on bargaining with non-causally-counterfacting agents. But they’re confusions and questions that afford some mulling, which is pretty cool!