Right, as far as I can see, it achieves the won’t-be-deceptive aim. My issue is in seeing how we find a model that will consistently do the right thing in training (given that it’s using LCDT).
As I understand it, under LCDT an agent is going to trade an epsilon utility gain on non-agent-influencing-paths for an arbitrarily bad outcome on agent-influencing-paths (since by design it doesn’t care about those). So it seems that it’s going to behave unacceptably for almost all goals in almost all environments in which there can be negative side-effects on agents we care about.
We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved.
Quite possibly I’m still missing something, but I don’t currently see how the LCDT decisions do much useful work here (Am I wrong? Do you see LCDT decisions doing significant optimisation?). I can picture its being a useful wrapper around a simulation, but it’s not clear to me in what ways finding a non-deceptive (/benign) simulation is an easier problem than finding a non-deceptive (/benign) agent. (maybe side-channel attacks are harder??)
My issue is in seeing how we find a model that will consistently do the right thing in training (given that it’s using LCDT).
How about an LCDT agent with the objective of imitating HCH? Such an agent should be aligned and competitive, assuming the same is true of HCH. Such an agent certainly shouldn’t delete itself to free up disk space, since HCH wouldn’t do that—nor should it fall prey to the general argument you’re making about taking epsilon utility in a non-agent path, since there’s only one utility node it can influence without going through other agents, which is the delta between its next action and HCH’s action.
We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved.
I claim that, for a reasonably accurate HCH model that’s within some broad basin of attraction, an LCDT agent attempting to imitate that HCH model will end up aligned—and that the same is not true for any other decision theory/agent model that I know of. And LCDT can do this while being able to manage things like how to simulate most efficiently and how to allocate resources between different methods of simulation. The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way.
Ok thanks, I think I see a little more clearly where you’re coming from now. (it still feels potentially dangerous during training, but I’m not clear on that)
A further thought:
The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way
Ok, so suppose for the moment that HCH is aligned, and that we’re able to specify a sufficiently accurate HCH model. The hard part of the problem seems to be safe-and-efficient simulation of the output of that HCH model. I’m not clear on how this part works: for most priors, it seems that the LCDT agent is going to assign significant probability to its creating agentic elements within its simulation. But by assumption, it doesn’t think it can influence anything downstream of those (or the probability that they exist, I assume).
That seems to be the place where LCDT needs to do real work, and I don’t currently see how it can do so efficiently. If there are agentic elements contributing to the simulation’s output, then it won’t think it can influence the output. Avoiding agentic elements seems impossible almost by definition: if you can create an arbitrarily accurate HCH simulation without its qualifying as agentic, then your test-for-agents can’t be sufficiently inclusive.
But by assumption, it doesn’t think it can influence anything downstream of those (or the probability that they exist, I assume).
This is not true—LCDT is happy to influence nodes downstream of agent nodes, it just doesn’t believe it can influence them through those agent nodes. So LCDT (at decision time) doesn’t believe it can change what HCH does, but it’s happy to change what it does to make it agree with what it thinks HCH will do, even though that utility node is downstream of the HCH agent nodes.
However, I still don’t see how LCDT can make good decisions over adjustments to its simulation. That simulation must presumably eventually contain elements classed as agentic. Then given any adjustment X which influences the simulation outcome both through agentic paths and non-agentic paths, the LCDT agent will ignore the influence [relative to the prior] through the agentic paths. Therefore it will usually be incorrect about what X is likely to accomplish.
It seems to me that you’ll also have incoherence issues here too: X can change things so that p(Y = 0) is 0.99 through a non-agentic path, whereas the agents assumes the equivalent of [p(Y = 0) is 0.5] through an agentic path.
I don’t see how an LCDT agent can make efficient adjustments to its simulation when it won’t be able to decide rationally on those judgements in the presence of agentic elements (which again, I assume must exist to simulate HCH).
That’s a really interesting thought—I definitely think you’re pointing at a real concern with LCDT now. Some thoughts:
Note that this problem is only with actually running agents internally, not with simply having the objective of imitating/simulating an agent—it’s just that LCDT will try to simulate that agent exclusively via non-agentic means.
That might actually be a good thing, though! If it’s possible to simulate an agent via non-agentic means, that certainly seems a lot safer than internally instantiating agents—though it might just be impossible to efficiently simulate an agent without instantiating any agents internally, in which case it would be a problem.
In some sense, the core problem here is just that the LCDT agent needs to understand how to decompose its own decision nodes into individual computations so it can efficiently compute things internally and then know when and when not to label its internal computations as agents. How to decompose nodes into subnodes to properly work with multiple layers is a problem with all CDT-based decision theories, though—and it’s hopefully the sort of problem that finite factored sets will help with.
Ok, that mostly makes sense to me. I do think that there are still serious issues (but these may be due to my remaining confusions about the setup: I’m still largely reasoning about it “from outside”, since it feels like it’s trying to do the impossible).
For instance:
I agree that the objective of simulating an agent isn’t a problem. I’m just not seeing how that objective can be achieved without the simulation taken as a whole qualifying as an agent. Am I missing some obvious distinction here? If for all x in X, sim_A(x) = A(x), then if A is behaviourally an agent over X, sim_A seems to be also.(Replacing equality with approximate equality doesn’t seem to change the situation much in principle) [Pre-edit: Or is the idea that we’re usually only concerned with simulating some subset of the agent’s input->output mapping, and that a restriction of some function may have different properties from the original function? (agenthood being such a property)]
I can see that it may be possible to represent such a simulation as a group of nodes none of which is individually agentic—but presumably the same could be done with a human. It can’t be ok for LCDT to influence agents based on having represented them as collections of individually non-agentic components.
Even if sim_A is constructed as a Chinese room (w.r.t. agenthood), it’s behaving collectively as an agent.
“it’s just that LCDT will try to simulate that agent exclusively via non-agentic means”—mostly agreed, and agreed that this would be a good thing (to the extent possible). However, I do think there’s a significant difference between e.g.: [LCDT will not aim to instantiate agents] (true) vs [LCDT will not instantiate agents] (potentially false: they may be side-effects)
Side-effect-agents seem plausible if e.g.: a) The LCDT agent applies adjustments over collections within its simulation. b) An adjustment taking [useful non-agent] to [more useful non-agent] also sometimes takes [useful non-agent] to [agent].
Here it seems important that LCDT may reason poorly if it believes that it might create an agent. I agree that pre-decision-time processing should conclude that LCDT won’t aim to create an agent. I don’t think it will conclude that it won’t create an agent.
Agreed that finite factored sets seem promising to address any issues that are essentially artefacts of representations. However, the above seem more fundamental, unless I’m missing something.
Assuming this is actually a problem, it struck me that it may be worth thinking about a condition vaguely like:
An LCDTn agent cuts links at decision time to every agent other than [LCDTm agents where m > n].
The idea being to specify a weaker condition that does enough forwarding-the-guarantee to allow safe instantiation of particular types of agent while still avoiding deception.
I’m far from clear that anything along these lines would help: it probably doesn’t work, and it doesn’t seem to solve the side-effect-agent problem anyway: [complete indifference to influence on X] and [robustly avoiding creation of X] seem fundamentally incompatible.
Right, as far as I can see, it achieves the won’t-be-deceptive aim. My issue is in seeing how we find a model that will consistently do the right thing in training (given that it’s using LCDT).
As I understand it, under LCDT an agent is going to trade an epsilon utility gain on non-agent-influencing-paths for an arbitrarily bad outcome on agent-influencing-paths (since by design it doesn’t care about those). So it seems that it’s going to behave unacceptably for almost all goals in almost all environments in which there can be negative side-effects on agents we care about.
We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved.
Quite possibly I’m still missing something, but I don’t currently see how the LCDT decisions do much useful work here (Am I wrong? Do you see LCDT decisions doing significant optimisation?).
I can picture its being a useful wrapper around a simulation, but it’s not clear to me in what ways finding a non-deceptive (/benign) simulation is an easier problem than finding a non-deceptive (/benign) agent. (maybe side-channel attacks are harder??)
How about an LCDT agent with the objective of imitating HCH? Such an agent should be aligned and competitive, assuming the same is true of HCH. Such an agent certainly shouldn’t delete itself to free up disk space, since HCH wouldn’t do that—nor should it fall prey to the general argument you’re making about taking epsilon utility in a non-agent path, since there’s only one utility node it can influence without going through other agents, which is the delta between its next action and HCH’s action.
I claim that, for a reasonably accurate HCH model that’s within some broad basin of attraction, an LCDT agent attempting to imitate that HCH model will end up aligned—and that the same is not true for any other decision theory/agent model that I know of. And LCDT can do this while being able to manage things like how to simulate most efficiently and how to allocate resources between different methods of simulation. The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way.
Ok thanks, I think I see a little more clearly where you’re coming from now.
(it still feels potentially dangerous during training, but I’m not clear on that)
A further thought:
Ok, so suppose for the moment that HCH is aligned, and that we’re able to specify a sufficiently accurate HCH model. The hard part of the problem seems to be safe-and-efficient simulation of the output of that HCH model.
I’m not clear on how this part works: for most priors, it seems that the LCDT agent is going to assign significant probability to its creating agentic elements within its simulation. But by assumption, it doesn’t think it can influence anything downstream of those (or the probability that they exist, I assume).
That seems to be the place where LCDT needs to do real work, and I don’t currently see how it can do so efficiently. If there are agentic elements contributing to the simulation’s output, then it won’t think it can influence the output.
Avoiding agentic elements seems impossible almost by definition: if you can create an arbitrarily accurate HCH simulation without its qualifying as agentic, then your test-for-agents can’t be sufficiently inclusive.
...but hopefully I’m still confused somewhere.
This is not true—LCDT is happy to influence nodes downstream of agent nodes, it just doesn’t believe it can influence them through those agent nodes. So LCDT (at decision time) doesn’t believe it can change what HCH does, but it’s happy to change what it does to make it agree with what it thinks HCH will do, even though that utility node is downstream of the HCH agent nodes.
Ah yes, you’re right there—my mistake.
However, I still don’t see how LCDT can make good decisions over adjustments to its simulation. That simulation must presumably eventually contain elements classed as agentic.
Then given any adjustment X which influences the simulation outcome both through agentic paths and non-agentic paths, the LCDT agent will ignore the influence [relative to the prior] through the agentic paths. Therefore it will usually be incorrect about what X is likely to accomplish.
It seems to me that you’ll also have incoherence issues here too: X can change things so that p(Y = 0) is 0.99 through a non-agentic path, whereas the agents assumes the equivalent of [p(Y = 0) is 0.5] through an agentic path.
I don’t see how an LCDT agent can make efficient adjustments to its simulation when it won’t be able to decide rationally on those judgements in the presence of agentic elements (which again, I assume must exist to simulate HCH).
That’s a really interesting thought—I definitely think you’re pointing at a real concern with LCDT now. Some thoughts:
Note that this problem is only with actually running agents internally, not with simply having the objective of imitating/simulating an agent—it’s just that LCDT will try to simulate that agent exclusively via non-agentic means.
That might actually be a good thing, though! If it’s possible to simulate an agent via non-agentic means, that certainly seems a lot safer than internally instantiating agents—though it might just be impossible to efficiently simulate an agent without instantiating any agents internally, in which case it would be a problem.
In some sense, the core problem here is just that the LCDT agent needs to understand how to decompose its own decision nodes into individual computations so it can efficiently compute things internally and then know when and when not to label its internal computations as agents. How to decompose nodes into subnodes to properly work with multiple layers is a problem with all CDT-based decision theories, though—and it’s hopefully the sort of problem that finite factored sets will help with.
Ok, that mostly makes sense to me. I do think that there are still serious issues (but these may be due to my remaining confusions about the setup: I’m still largely reasoning about it “from outside”, since it feels like it’s trying to do the impossible).
For instance:
I agree that the objective of simulating an agent isn’t a problem. I’m just not seeing how that objective can be achieved without the simulation taken as a whole qualifying as an agent. Am I missing some obvious distinction here?
If for all x in X, sim_A(x) = A(x), then if A is behaviourally an agent over X, sim_A seems to be also.(Replacing equality with approximate equality doesn’t seem to change the situation much in principle)
[Pre-edit: Or is the idea that we’re usually only concerned with simulating some subset of the agent’s input->output mapping, and that a restriction of some function may have different properties from the original function? (agenthood being such a property)]
I can see that it may be possible to represent such a simulation as a group of nodes none of which is individually agentic—but presumably the same could be done with a human. It can’t be ok for LCDT to influence agents based on having represented them as collections of individually non-agentic components.
Even if sim_A is constructed as a Chinese room (w.r.t. agenthood), it’s behaving collectively as an agent.
“it’s just that LCDT will try to simulate that agent exclusively via non-agentic means”—mostly agreed, and agreed that this would be a good thing (to the extent possible).
However, I do think there’s a significant difference between e.g.:
[LCDT will not aim to instantiate agents] (true)
vs
[LCDT will not instantiate agents] (potentially false: they may be side-effects)
Side-effect-agents seem plausible if e.g.:
a) The LCDT agent applies adjustments over collections within its simulation.
b) An adjustment taking [useful non-agent] to [more useful non-agent] also sometimes takes [useful non-agent] to [agent].
Here it seems important that LCDT may reason poorly if it believes that it might create an agent. I agree that pre-decision-time processing should conclude that LCDT won’t aim to create an agent. I don’t think it will conclude that it won’t create an agent.
Agreed that finite factored sets seem promising to address any issues that are essentially artefacts of representations. However, the above seem more fundamental, unless I’m missing something.
Assuming this is actually a problem, it struck me that it may be worth thinking about a condition vaguely like:
An LCDTn agent cuts links at decision time to every agent other than [LCDTm agents where m > n].
The idea being to specify a weaker condition that does enough forwarding-the-guarantee to allow safe instantiation of particular types of agent while still avoiding deception.
I’m far from clear that anything along these lines would help: it probably doesn’t work, and it doesn’t seem to solve the side-effect-agent problem anyway: [complete indifference to influence on X] and [robustly avoiding creation of X] seem fundamentally incompatible.
Thoughts welcome. With luck I’m still confused.