I don’t think this approach is going to generalize to alignment because in order to detailedly control agents in this way, you need to give exponentially many agents veto power, which means that even a small probability of veto from an individual agent will lead to certainty of veto from some agent. That said, this plausibly solves the shutdown problem.
I understand that this is not the goal but I thought it would be relevant to consider anyway, if the hope is to build on top of this.
But for the continuous limit the subagents become similar to each other at the same rate as they become more numerous. It seems intuitive to me that with a little grinding you could get a decision-making procedure whose policy is an optimum of an integral over “subagents” who bet on the button being pushed at different times, and so the whole system will change behavior upon an arbitrarily-timed press of the button.
Except I think in continuous time you probably lose guarantees about the system not manipulating humans to press/not press the button. Unless maybe each subagent believes the button can only be pressed exactly at their chosen time. But this highlights that maybe all of these counterfactuals give rise to really weird worlds, that in turn will give rise to weird behavior.
I could buy something like this with the continuous time limit.
I just mean if you want to extend this to cover things outside of the shutdown problem. Like you might want to request the AI to build you a fusion power plant, or cook you a chocolate cake, or make a company that sells pottery, or similar. You could have some way of generating a utility function for each possibility, and then generate subagents for all of them, but if you do this you’ve got an exponentially large conjunction.
I don’t think this approach is going to generalize to alignment because in order to detailedly control agents in this way, you need to give exponentially many agents veto power, which means that even a small probability of veto from an individual agent will lead to certainty of veto from some agent. That said, this plausibly solves the shutdown problem.
I understand that this is not the goal but I thought it would be relevant to consider anyway, if the hope is to build on top of this.
But for the continuous limit the subagents become similar to each other at the same rate as they become more numerous. It seems intuitive to me that with a little grinding you could get a decision-making procedure whose policy is an optimum of an integral over “subagents” who bet on the button being pushed at different times, and so the whole system will change behavior upon an arbitrarily-timed press of the button.
Except I think in continuous time you probably lose guarantees about the system not manipulating humans to press/not press the button. Unless maybe each subagent believes the button can only be pressed exactly at their chosen time. But this highlights that maybe all of these counterfactuals give rise to really weird worlds, that in turn will give rise to weird behavior.
I could buy something like this with the continuous time limit.
I just mean if you want to extend this to cover things outside of the shutdown problem. Like you might want to request the AI to build you a fusion power plant, or cook you a chocolate cake, or make a company that sells pottery, or similar. You could have some way of generating a utility function for each possibility, and then generate subagents for all of them, but if you do this you’ve got an exponentially large conjunction.