The natural generalization is then to have one subagent for each time at which the button could first be pressed (including one for “button is never pressed”, i.e. the button is first pressed at t=∞). So subagent ∞ maximizes E[u1 | do(∀t:buttont = unpressed), observations], and for all other times subagent T maximizes E[u2 | do(∀t<T:buttont = unpressed, buttonT = pressed), observations]. The same arguments from above then carry over, as do the shortcomings (discussed in the next section).
Can you explain how this relates to Elliot Thornley’s proposal? It’s pattern matching in my brain but I don’t know the technical details.
I don’t know. Thornley’s proposal got me thinking about subagents as a tool for corrigibility, but I never understood his properties well enough to say how his subagents relate to the counterfactual-optimizing agents in this proposal.
Can you explain how this relates to Elliot Thornley’s proposal? It’s pattern matching in my brain but I don’t know the technical details.
I don’t know. Thornley’s proposal got me thinking about subagents as a tool for corrigibility, but I never understood his properties well enough to say how his subagents relate to the counterfactual-optimizing agents in this proposal.