Seems potentially valuable as an additional layer of capability control to buy time for further control research. I suspect LCDT won’t hold once intelligence reaches some threshold: some sense of agents, even if indirect, is such a natural thing to learn about the world.
Could you give a more explicit example of what you think might go wrong? I feel like your argument that agency is natural to learn actually goes in LCDT’s favor, because it requires an accurate (or at least an overapproximation) of tagging things in its causal model as agentic.
Probabilistic/inductive reasoning from past/simulated data (possibly assumes imperfect implementation of LCDT):
”This is really weird because obviously I could never influence an agent, but when past/simulated agents that look a lot like me did X, humans did Y in 90% of cases, so I guess the EV of doing X is 0.9 * utility(Y).”
Cf. smart humans in Newcomb’s prob: “This is really weird but if I one box I get the million, if I two-box I don’t, so I guess I’ll just one box.”
Yeah, I think this assumes an imperfect implementation. This relation can definitely be learned by the causal model (and is probably learned before the first real decision), but when the decision happen, it is cut. So it’s like a true LCDT agent learns about influences over agent, but forget its own ability to do that when deciding.
These examples seem related to the abstraction question: we want the model to know that it is splitting an agent into parts, and still believe it can’t influence the agent as a hole. If we could realize this, then the LCDT agent wouldn’t believe it could influence the neural net/the logic gates.
Seems potentially valuable as an additional layer of capability control to buy time for further control research. I suspect LCDT won’t hold once intelligence reaches some threshold: some sense of agents, even if indirect, is such a natural thing to learn about the world.
Could you give a more explicit example of what you think might go wrong? I feel like your argument that agency is natural to learn actually goes in LCDT’s favor, because it requires an accurate (or at least an overapproximation) of tagging things in its causal model as agentic.
For a start, low-level deterministic reasoning:
”Obviously I could never influence an agent, but I found some inputs to deterministic biological neural nets that would make things I want happen.”
″Obviously I could never influence my future self, but if I change a few logic gates in this processor, it would make things I want happen.”
Probabilistic/inductive reasoning from past/simulated data (possibly assumes imperfect implementation of LCDT):
”This is really weird because obviously I could never influence an agent, but when past/simulated agents that look a lot like me did X, humans did Y in 90% of cases, so I guess the EV of doing X is 0.9 * utility(Y).”
Cf. smart humans in Newcomb’s prob: “This is really weird but if I one box I get the million, if I two-box I don’t, so I guess I’ll just one box.”
Yeah, I think this assumes an imperfect implementation. This relation can definitely be learned by the causal model (and is probably learned before the first real decision), but when the decision happen, it is cut. So it’s like a true LCDT agent learns about influences over agent, but forget its own ability to do that when deciding.
These examples seem related to the abstraction question: we want the model to know that it is splitting an agent into parts, and still believe it can’t influence the agent as a hole. If we could realize this, then the LCDT agent wouldn’t believe it could influence the neural net/the logic gates.