Cool, thanks for the reply, sounds like maybe a combination of 3a and the aspect of 1 where the shard points to a part of the world model? If no part of the agent is having its weights tuned to choose plans that make a shard happy, where would you say a shard mostly lives in an agent? World model? Somewhere else? Spread across multiple components? (At the bottom of this comment, I propose a different agent architecture that we can use to discuss this that I think fairly naturally matches the way you’ve been talking about shards.)
Notice that there is a giant difference between “try to get diamonds” and “try to get the diamond-shard to be happy”, analogous to the difference between “try to make a million bucks” and “try to convince yourself you made a million bucks”. If I wanted to generate a plan to convince myself I’d made a million bucks, my plan-generator could, but I don’t want to, because that isn’t a strategy I expect to help me get the things I want, like a million bucks.
My model doesn’t predict that most agents will try to execute “fool myself into thinking I have a million bucks” style plans. If you think my model predicts that, then maybe this is an opportunity to make progress?
In my model, the agent is actually allowed to care about actual world states and not just its own internal activations. Consider two ways the agent could “fool” itself into thinking it had a million bucks:
Firstly, it could tamper with its own mind to create that impression. This could be either through hacking, or carefully spoofing its own sensory inputs. When planning, the predicted future states of the world are going to correctly show that the agent is fooling itself. So when the value function is fed the predicted world states, it’s going to rate them as bad, since in those world states, the agent does not have a million bucks. It doesn’t matter to the agent that later, after being hacked, the value function will think the agent has a million bucks. Right now, during planning, the value function isn’t fooled.
Secondly, it could create a million counterfeit bucks. Due to inaccuracies in training, maybe the agent actually thinks that having counterfeit money is just as good as real money. I.e. the value function does actually rate counterfeit bucks higher than real bucks. If so, then the agent is going to be perfectly satisfied with itself for coming up with this clever idea for satisfying its true values. The humans who were training the agent and wanted a million actual dollars won’t be satisfied, but that’s their problem, not the agent’s problem.
Okay, so here’s another possible agent design that we might be able to discuss: There’s a detailed world model that can answer various probabilistic queries about the future. Planning is done by formulating a query to the world model in the following way: Sample histories h according to:
P(h)=exp(U(h)/T)∑h′exp(U(h′)/T)
Where U(h) is utility assigned by the agent to a given history, and T is some small number which is analogous to temperature. Histories include the action taken by the agent, so we can sample with the action undetermined, and then actually take whichever action happened in the sampled history. (For simplicity, I’m assuming that the agent only takes one action per episode.)
To make this a “shardy” agent, we’d presumably have to replace U with something made out of shards. As long as it shows up like that in the equation for P(h), though, it looks like we’d have to make U adversarially robust. I’d be interested in what other modifications you’d want to make to this agent in order to make it properly shardy. This agent design does have the advantage that it seems like values are more directly related to the world model, since they’re able to directly examine h.
If no part of the agent is having its weights tuned to choose plans that make a shard happy, where would you say a shard mostly lives in an agent? World model? Somewhere else? Spread across multiple components?
I’d say the shards live in the policy, basically, though these are all leaky abstractions. A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there’s a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.
My model doesn’t predict that most agents will try to execute “fool myself into thinking I have a million bucks” style plans. If you think my model predicts that, then maybe this is an opportunity to make progress?
I brought this up because I interpreted your previous comment as expressing skepticism that
no part of the agent is “trying to make the shard happy”
Whereas I think that it will be true for analogous reasons as the reasons that explain why no part of the agent is “trying to make itself believe it has a million bucks”.
Due to inaccuracies in training, maybe the agent actually thinks that having counterfeit money is just as good as real money. I.e. the value function does actually rate counterfeit bucks higher than real bucks. If so, then the agent is going to be perfectly satisfied with itself for coming up with this clever idea for satisfying its true values.
I have a vague feeling that the “value function map = agent’s true values” bit of this is part of the crux we’re disagreeing about.
Putting that aside, for this to happen, it has to be simultaneously true that the agent’s world model knows about and thinks about counterfeit money in particular (or else it won’t be able to construct viable plans that produce counterfeit money) while its value function does not know or think about counterfeit money in particular. It also has to be true that the agent tends to generate plans towards counterfeit money over plans towards real money, or else it’ll pick a real money plan it generates before it has had a chance to entertain a counterfeit money plan.
But during training, the plan generator was trained to generate plans that lead to real money. And the agent’s world model / plan generator knows (or at least thinks) that those plans were towards real money, even if its value function doesn’t know. This is because it takes different steps to acquire counterfeit money than to acquire real money. If the plan generator was optimized based on the training environment, and the agent was rewarded there for doing the sorts of things that lead to acquiring real money (which are different from the things that lead to counterfeit money), then those are the sorts of plans it will tend to generate. So why are we hypothesizing that the agent will tend to produce and choose the kinds of counterfeit money plans its value function would “fail” (from our perspective) on after training?
Okay, so here’s another possible agent design that we might be able to discuss:
At best, the agent could sample from a learned history generator, one tuned on previous good histories, and then evaluate some number of possibilities from that distribution, picking one that’s good according to its evaluation. But that doesn’t require adversarial robustness, because the history generator will tend strongly to generate possibilities like the ones that evaluated/worked well in the past, which is exactly where the evaluations will tend to be fairly accurate. And the better the generator, the less options need to be sampled, so the less you’re “implicitly optimizing against” the evaluations.
A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there’s a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.
Thanks for describing this. Technical question about this design: How are you getting the gradients that feed backwards into the action head? I assume it’s not supervised learning where it’s just trying to predict which action a human would take?
IMO an agent can’t work this way, at least not an embedded one.
I’m aware of the issue of embedded agency, I just didn’t think it was relevant here. In this case, we can just assume that the world looks fairly Cartesian to the agent. The agent makes one decision (though possibly one from an exponentially large decision space) then shuts down and loses all its state. The record of the agent’s decision process in the future history of the world just shows up as thermal noise, and it’s unreasonable to expect the agent’s world model to account for thermal noise as anything other than a random variable. As a Cartesian hack, we can specify a probability distribution over actions for the world model to use when sampling. So for our particular query, we can specify a uniform distribution across all actions. Then in reality, the actual distribution over actions will be biased towards certain actions over others because they’re likely to result in higher utility.
Putting that aside, for this to happen, it has to be simultaneously true that the agent’s world model knows about and thinks about counterfeit money in particular (or else it won’t be able to construct viable plans that produce counterfeit money)
This seems to be phrased in a weird way that rules out creative thinking. Nicola Tesla didn’t have three phase motors in his world-model before he invented them, but he was able to come up with them (his mind was able to generate a “three phase motor” plan) anyways. The key thing isn’t having a certain concept already existing in your world model because of prior experience. The requirement is just that the world model is able to reason about the thing. Nicola Tesla knew enough E&M to reason about three phase motors, and I expect that smart AIs will have world models that can easily reason about counterfeit money.
while its value function does not know or think about counterfeit money in particular.
The job of a value function isn’t to know or think about things. It just gives either big numbers or small numbers when fed certain world states. The value function in question here gives a big number when you feed it a world state containing lots of counterfeit money. Does this mean it knows about counterfeit money? Maybe, but it doesn’t really matter.
A more relevant question is whether the plan-proposer knows the value function well enough that it knows the value function will give it points for suggesting the plan of producing counterfeit money. One could say probably yes, since it’s been getting all its gradients directly from the value function, or probably no, since there were no examples with counterfeit money in the training data. I’d say yes, but I’m guessing that this issue isn’t your crux, so I’ll only elaborate if asked.
It also has to be true that the agent tends to generate plans towards counterfeit money over plans towards real money, or else it’ll pick a real money plan it generates before it has had a chance to entertain a counterfeit money plan.
This sounds like you’re talking about a dumb agent; smart agents generate and compare multiple plans and don’t just go with the first plan they think of.
So why are we hypothesizing that the agent will tend to produce and choose the kinds of counterfeit money plans its value function would “fail” (from our perspective) on after training?
Generalization. For a general agent, thinking about counterfeit money plans isn’t that much different than thinking of plans to make money. Like if the agent is able to think of money making plans like starting a restaurant, or working as a programmer, or opening a printing shop, then it should also be able to think of a counterfeit money making plan. (That plan is probably quite similar to the printing shop plan.)
Thanks for describing this. Technical question about this design: How are you getting the gradients that feed backwards into the action head? I assume it’s not supervised learning where it’s just trying to predict which action a human would take?
Could be from rewards or other “external” feedback, could be from TD/bootstrapped errors, could be from an imitation loss or something else. The base case is probably just a plain ol’ rewards that get backpropagated through the action head via policy gradients.
In this case, we can just assume that the world looks fairly Cartesian to the agent. The agent makes one decision (though possibly one from an exponentially large decision space) then shuts down and loses all its state. The record of the agent’s decision process in the future history of the world just shows up as thermal noise, and it’s unreasonable to expect the agent’s world model to account for thermal noise as anything other than a random variable. As a Cartesian hack, we can specify a probability distribution over actions for the world model to use when sampling. So for our particular query, we can specify a uniform distribution across all actions. Then in reality, the actual distribution over actions will be biased towards certain actions over others because they’re likely to result in higher utility.
Sorry for being unclear, I think you’re talking about a different dimension of embeddness than what I was pointing at. I was talking about the issue of logical uncertainty: that the agent needs to actually run computation in order to figure out certain things. The agent can’t magically sample from P(h) proportional to exp(U(h)), because it needs the exp(U(h')) of all the other histories first in order to weigh the distribution that way, which requires having already sampled h' and having already calculated U(h'). But we are talking about how it samples a history h in the first place! The “At best” comment was proposing an alternative that might work, where the agent samples from a prior that’s been tuned based on U(h). Notice, though, that “our sampling is biased towards certain histories over others because they resulted in higher utility” does not imply “if a history would result in higher utility, then our sampling will bias towards it”.
Consider a parallel situation: sampling images and picking one that gets the highest score on a non-robust face classifier. If we were able to sample from the distribution of images proportional to their (exp) face classifier scores, then we would need to worry a lot about picking an image that’s an adversarial example to our face classifier, because those can have absurdly high scores. But instead we need to sample images from the prior of a generative model like FFHQ StyleGAN2-ADA or DDPM, and score those images. A generative model like that will tend strongly to convert whatever input entropy you give it into a natural-looking image, so we can sample & filter from it a ton without worrying much about adversarial robustness. Even if you sample 10K images and pick the 5 with the highest face classifier scores, I am betting that the resulting images will still really be of faces, rather than classifier-adversarial examples. It is true that “our top-5 images are biased towards some images over others because they had higher face classification scores” but it is not true that “if adversarial examples would have higher face classification scores than regular images, then our sampling will bias towards adversarial examples.”
The job of a value function isn’t to know or think about things. It just gives either big numbers or small numbers when fed certain world states. The value function in question here gives a big number when you feed it a world state containing lots of counterfeit money. Does this mean it knows about counterfeit money? Maybe, but it doesn’t really matter.
I disagree with this? The job of the value function is to look at the agent state (what the agent knows about the world) and estimate what the return will be based on that state & its implications. This involves “knowing things” and “thinking about things”. If the agent has a model T(s, a, s') that represents some state feature like “working a job” or “building a money-printer”, then that feature should exist in the state passed to the value function V(s)/Q(s, a) as well, such that the value function will “know” about it and incorporate it into its estimation process.
But it sounds like in the imagined scenario, the agent’s model and policy are sensitive to a bunch of stuff that the value function is blind to, which makes this configuration seem quite weird to me. If the value function was not blind to those features, then as the model goes from the training environment, where it got returns based on “getting paid money for tasks” (or whatever we rewarded for there) to the deployment environment, where the action space is even bigger, both the model/policy and the value function “generalize” and learn motivationally-relevant new facts that inform it what “getting paid money for tasks” looks like.
A more relevant question is whether the plan-proposer knows the value function well enough that it knows the value function will give it points for suggesting the plan of producing counterfeit money. One could say probably yes, since it’s been getting all its gradients directly from the value function, or probably no, since there were no examples with counterfeit money in the training data. I’d say yes, but I’m guessing that this issue isn’t your crux, so I’ll only elaborate if asked.
The plan-proposer isn’t trying to make the value function give it max points, it’s trying to [plan towards the current subgoal], which in this case is “make a million bucks”. The plan-proposer gets gradients/feedback from the value function, in the form of TD upates that tell it “this thought was current-subgoal-better/worse than expected upon consideration, do more/less of that in mental contexts like this”. But a thought like “Inspect my own value function to see what it’d give me positive TD updates for” evaluates as current-subgoal-worse for the subgoal of “make a million bucks” than a more actually-subgoal-relevant thought like “Think about what tasks are available nearby”, so the advantage is negative, which means a negative TD update, causing that thought to be progressively abandoned.
This sounds like you’re talking about a dumb agent; smart agents generate and compare multiple plans and don’t just go with the first plan they think of.
Generalization. For a general agent, thinking about counterfeit money plans isn’t that much different than thinking of plans to make money. Like if the agent is able to think of money making plans like starting a restaurant, or working as a programmer, or opening a printing shop, then it should also be able to think of a counterfeit money making plan. (That plan is probably quite similar to the printing shop plan.)
Once again, I agree that the agent is smart, and generalizes to be able to think of such a plan, and even plans much more sophisticated than that. I am arguing that it won’t particularly want to produce/execute those plans, unless its motivations & other cognition is biased in a very specific way.
Thanks for the reply. Just to prevent us from spinning our wheels too much, I’m going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we’re thinking of agents that work in different ways when making our points.
PolicyGradientBot: Defined by the following description:
A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there’s a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.
The base case is probably just a plain ol’ rewards that get backpropagated through the action head via policy gradients.
ThermodynamicBot: Defined by the following description:
There’s a detailed world model that can answer various probabilistic queries about the future. Planning is done by formulating a query to the world model in the following way: Sample histories h according to:
P(h)=exp(U(h)/T)∑h′exp(U(h')/T)
Where U(h) is utility assigned by the agent to a given history, and T is some small number which is analogous to temperature.
As a Cartesian hack, we can specify a probability distribution over actions for the world model to use when sampling. So for our particular query, we can specify a uniform distribution across all actions. Then in reality, the actual distribution over actions will be biased towards certain actions over others because they’re likely to result in higher utility.
Comments on ThermodynamicBot
Sorry for being unclear, I think you’re talking about a different dimension of embeddness than what I was pointing at. I was talking about the issue of logical uncertainty: that the agent needs to actually run computation in order to figure out certain things. The agent can’t magically sample from P(h) proportional to exp(U(h)), because it needs the exp(U(h’)) of all the other histories first in order to weigh the distribution that way, which requires having already sampled h’ and having already calculated U(h’). But we are talking about how it samples a history h in the first place! The “At best” comment was proposing an alternative that might work, where the agent samples from a prior that’s been tuned based on U(h). Notice, though, that “our sampling is biased towards certain histories over others because they resulted in higher utility” does not imply “if a history would result in higher utility, then our sampling will bias towards it”.
This bot is of course a bounded agent, and so the world model can’t be perfect, but consider the following steps:
For each possible h, compute U(h) and exp(U(h)/T).
Compute the sum Z=∑h′exp(U(h′)/T)
Now we know the probability for any given history h: It’s exp(U(h)/T)/Z.
This is a finite sequence of computational steps that terminates without self-reference, so no logical induction is needed here. Now you may fairly object that there is still an issue of computational complexity. The space of histories is exponentially large, so in practice the computation couldn’t be completed in time. This is the known-to-be-hard problem of computing the partition function. But the problem is tractable in many special cases, and humans get by well enough in our own reasoning about a world full of combinatorial explosions. We can suppose that, at the cost of making itself even more of an approximation, the world model has a way to efficiently sample from the distribution, even given the difficulty of computing Z. To take one particular concrete way this could be implemented, if the world model is a factor graph with few or no loops, then we can do the computation by adding on a few factors to account for exp(U(h)/T) and then using belief propagation to solve it.
Consider a parallel situation: sampling images and picking one that gets the highest score on a non-robust face classifier. If we were able to sample from the distribution of images proportional to their (exp) face classifier scores, then we would need to worry a lot about picking an image that’s an adversarial example to our face classifier, because those can have absurdly high scores. But instead we need to sample images from the prior of a generative model like FFHQ StyleGAN2-ADA or DDPM, and score those images. A generative model like that will tend strongly to convert whatever input entropy you give it into a natural-looking image, so we can sample & filter from it a ton without worrying much about adversarial robustness. Even if you sample 10K images and pick the 5 with the highest face classifier scores, I am betting that the resulting images will still really be of faces, rather than classifier-adversarial examples. It is true that “our top-5 images are biased towards some images over others because they had higher face classification scores” but it is not true that “if adversarial examples would have higher face classification scores than regular images, then our sampling will bias towards adversarial examples.”
I agree with your predictions for what would happen here. ThermodynamicBot isn’t really a GAN, though. If we tried to train a GAN to sub-in for ThermodynamicBot’s world model, then we could do it two ways (2nd way is most similar to your proposal). In both cases, the generator produces candidate histories, h.
Discriminator tries to predict what’s real (as usual), but generator rewards are shifted by U(h)/T to encourage it to produce outputs that are properly utility-weighted. (For simplicity, we can suppose that U(h) is differentiable so that gradients can flow back through it to the generator.) I expect this results in the generator producing adversarial examples if U is not adversarially robust.
Discriminator tries to predict what’s real and generator rewards are not shifted. Instead, we sample 10k histories from the generator and weight them all by utility. Or even just go full argmax and choose the history with the highest utility. 10k is not that large, it gives the agent about 14 bits of optimization power. In general, if it takes n bits of optimization power on the latent vector to convince the generator to construct an adversarial example, then we can safely use this scheme to provide slightly less than n bits of useful optimization. But we could do that multiple times, so maybe it’s a bit more useful than that. Also, maybe n is super large so it’s fine in practice?
Comments on PolicyGradientBot
So, the standard criticism of policy gradient is that it’s noisy and doesn’t really allow for credit assignment. In particular, I think the lack of credit assignment is really a crucial flaw that will prevent policy gradient from ever being used to create an AGI. As far as I know, no agent powered purely by policy gradient does anything particularly impressive, though I could be wrong about that. Do you have any arguments / ideas for how policy gradient could be made to work here?
General Clarifications of my Position
But a thought like “Inspect my own value function to see what it’d give me positive TD updates for” evaluates as current-subgoal-worse for the subgoal of “make a million bucks” than a more actually-subgoal-relevant thought
Just to clarify, my model doesn’t predict that the AI will use introspection on its own value function, or even look at its own source code at all. Some AIs may be designed to do that, but it’s not required for the failure case I’m considering. If gradients are flowing backwards from the value function to the actor (Bot-specific statement alert: Actor-Critic bots) then the actor has probably absorbed a significant amount of information about the value function’s misalignment. I don’t think it needs to take any additional steps of inspecting the agent’s own weights, or anything like that. You may object that gradients need not necessarily flow backwards there. After all, policy gradient and temporal difference learning are a thing. Let’s join the thread of discussion where you make that objection onto the PolicyGradientBot discussion.
Also, my model doesn’t predict that the agent’s subgoal reasoning will suddenly go off the rails and fail to achieve the subgoal in question because the agent was too busy thinking about how to counterfeit money. If you’re factoring the agent’s reasoning into subgoals, you have to remember that there’s a factor that actually sets the subgoals, and that’s where there’s the potential to go off the rails and start considering subgoals like “print lots of counterfeit money”. Obviously in the context where the agent is already considering the “get paid money for performing tasks” subgoal, the agent’s reasoning isn’t going to get screwed up.
Thanks for the reply. Just to prevent us from spinning our wheels too much, I’m going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we’re thinking of agents that work in different ways when making our points.
Sounds good.
Comments on ThermodynamicBot
This bot is of course a bounded agent, and so the world model can’t be perfect, but consider the following steps: [...] This is a finite sequence of computational steps that terminates without self-reference, so no logical induction is needed here. Now you may fairly object that there is still an issue of computational complexity. The space of histories is exponentially large, so in practice the computation couldn’t be completed in time. This is the known-to-be-hard problem of computing the partition function. But the problem is tractable in many special cases, and humans get by well enough in our own reasoning about a world full of combinatorial explosions. We can suppose that, at the cost of making itself even more of an approximation, the world model has a way to efficiently sample from the distribution, even given the difficulty of computing Z. To take one particular concrete way this could be implemented, if the world model is a factor graph with few or no loops, then we can do the computation by adding on a few factors to account for exp(U(h)/T) and then using belief propagation to solve it.
If we assume that the agent is making decisions by (approximately) plugging in every possible h into U(h) and picking based on (the partition function derived from) that, then of course you need U(h) to be adversarially robust! I disagree with that as a model of how planning works or should work. IMO, not only is “plug every possible h into U(h)” extremely computationally infeasible, but even if it were feasible it would be a forseeably-broken (because fragile) planning strategy.
Quote from a comment of TurnTrout about argmax planning, though I think it also applies to ThermodynamicBot, since that just does a softened version of argmax planning (converging to argmax planning as T->0):
if you design an AI that doesn’t argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn’t mean giving up argmax.
This seems exactly backwards to me. Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you’re also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I’d have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I’d have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.
I agree with your predictions for what would happen here. ThermodynamicBot isn’t really a GAN, though. If we tried to train a GAN to sub-in for ThermodynamicBot’s world model, then we could do it two ways (2nd way is most similar to your proposal). In both cases, the generator produces candidate histories, h.
Right, I wasn’t thinking of it as actually a GAN, just giving an analogy where similar causal patterns are in play, to make my point clearer. But yeah, if we wanted to actually use a GAN, your suggestions sound reasonable.
Comments on PolicyGradientBot & General Position
So, the standard criticism of policy gradient is that it’s noisy and doesn’t really allow for credit assignment. In particular, I think the lack of credit assignment is really a crucial flaw that will prevent policy gradient from ever being used to create an AGI. As far as I know, no agent powered purely by policy gradient does anything particularly impressive, though I could be wrong about that. Do you have any arguments / ideas for how policy gradient could be made to work here?
I guess, but I’m confused why we’re talking about competitiveness all of a sudden. I mean, variants on policy gradient algorithms (PPO, Actor-Critic, etc.) do some impressive things (at least to the extent any RL algorithms currently do impressive things). And I can imagine more sophisticated versions of even plain policy gradient that would do impressive things, if the action space includes mental actions like “sample a rollout from my world model”.
But that’s neither here nor there IMO. In the previous comment, I tried to be clear that it makes me ~no difference where the gradients come from, when I said
Could be from rewards or other “external” feedback, could be from TD/bootstrapped errors, could be from an imitation loss or something else.
Because I think that the outline of my argument doesn’t depend on one specific RL algorithm.
Just to clarify, my model doesn’t predict that the AI will use introspection on its own value function, or even look at its own source code at all. Some AIs may be designed to do that, but it’s not required for the failure case I’m considering. If gradients are flowing backwards from the value function to the actor (Bot-specific statement alert: Actor-Critic bots) then the actor has probably absorbed a significant amount of information about the value function’s misalignment. I don’t think it needs to take any additional steps of inspecting the agent’s own weights, or anything like that. You may object that gradients need not necessarily flow backwards there. After all, policy gradient and temporal difference learning are a thing. Let’s join the thread of discussion where you make that objection onto the PolicyGradientBot discussion.
Consider ActorCriticBot. I would make the same argument for it as I did in my previous comment, that the actor is optimized by the critic’s outputs but that the actor is not optimizing for critic outputs. It does not particularly matter that Critic("I made counterfeit money") >= Critic("I got money for doing the task") or that Critic("Extremely out-of-distribution state that Critic happens to evaluate super highly") >> Critic("I got money for doing the task"), because the actor in actor-critic doesn’t make its decisions by running an internal CriticEstimator(plan) and doing whatever evaluates best. It makes its decisions by looking at the current state and weighing the decision-factors triggered by that state; decision-factors that were internalized because they were upstream of actual positive feedback it received in the past from the environment+critic.
In the training distribution, the agent never reached a state like "I made counterfeit money", so the critic never gave feedback from that “misaligned” portion of the state space, so the actor never got gradients from the critic that differentially upweighted the actor’s concern for counterfeit money-specific factors, so the actor never internalized the particular antecedents of counterfeit money as motivating, so actor doesn’t factor them into its decision-making. Whereas the actor does factor things like “Am I doing the task yet?” into its decisions, because the actor internalized those as antecedents of reward/value, because the actor got gradients from the critic that differentially upweighted the agent’s concern those factors, because those were part of the state space covered by the training distribution, because the agent actually reached states where it got paid for e.g. task-completion.
IMO, not only is “plug every possible h into U(h)” extremely computationally infeasible
To be clear, I’m not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that’s a tree. Call this ThermodynamicBot-F. You could also imagine the role of “world model” being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBot-N.
Yes, I understand that running a search that will kill you if it succeeds is dumb. This has been known for many years. The question is how do we actually write a program to do a sane search? You quote TurnTrout:
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
I don’t find this particularly helpful. If we know which plans are adversarial so we can eliminate them from the search space, we’re already half way to solving alignment. I don’t think the plans a bounded agent is going to eliminate so that it can finish its thinking on time are automatically going to be the adversarial ones. I think this is a problem that is going to take actual effort.
In particular for ThermodynamicBot:
Case where the world model is implemented in a factor graph (ThermodynamicBot-F): This gives exactly the same result as searching across all inputs, but the computation is efficient, and not really wasteful in any sense. If we imagine trying to “improve” the belief propagation algorithm to simultaneously make it more efficient and also remove some subset of plans it’s searching over that are “adversarial”, I can’t really imagine a way to do that, and it would certainly make the algorithm more complicated and less elegant.
Case where a neural network world model is being used (ThermodynamicBot-N): In this case there are likely plans that will be missed by ThermodynamicBot-N because of the bounded nature of its world model, even though they would be found by searching across all inputs. But if we imagine training the world model to make it better, I would generally expect this to increase the world model’s ability to find adversarial plans just like it increases its ability to find good plans. In general, I don’t expect there to be any correlation where all the adversarial plans happen to be eliminated due to bounded reasoning. Why should we be so lucky that all the errors we’re making happen to cancel each other out?
I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.
I agree if we’re literally talking about brute force search here. If we’re talking about the more realistic ThermodynamicBot designs I’ve mentioned, then I’m not sure I agree. In some sense, all methods an agent could use to plan are “picking plans from plan-space that are better than most other plans”. Even ActorCriticBot is “trying” to approximate argmax. If we could train it to minimal loss, it would be an ArgMaxBot. Is there some particular approximation or heuristic that we can adopt, where if we do adopt it we go from dangerously approaching ArgMaxBot to safely searching through only good plans? An approximation used by ActorCriticBot, but not by ThermodynamicBot-N? If so, I have no idea what the crucial approximation is that you could be thinking of.
I also don’t think it’s at all obvious that ThermodynamicBot designs are necessarily capability-limited. It makes a lot of sense to integrate planning very closely with the world model. Might be worth betting on the direction of future RL research here if we can set sufficiently objective resolution criteria? In any case, I do think this counts as some progress in this discussion, since we’ve found an example of an agent that we both agree your argument doesn’t apply to.
Comments on PolicyGradientBot vs ActorCriticBot
In my view, there’s kind of a huge gulf between PolicyGradientBot and ActorCriticBot, where the gradients flowing backwards into ActorCriticBot’s actor end up carrying a lot of information. This allows for much better performance, and in particular much better sample efficiency, at the cost that some of the information is about weaknesses in ActorCriticBot’s critic.
To take a particular example, if the critic overvalues blue diamonds, then gradients flowing into the actor are going to be steeper for actions that obtain blue diamonds. Then in a new environment where there’s a bucket of blue paint sitting in the corner, it seems reasonable to expect that the actor might try to use that bucket to paint diamonds blue, at least assuming it’s sufficiently intelligent and flexible.
For PolicyGradientBot on the other hand, while it could still result in alignment failures, it seems much more like we’re just directly training a policy. But PolicyGradientBot is very slow when it comes to sample efficiency.
WRT other algorithms like temporal difference learning that lie kind of in between PolicyGradientBot and ActorCriticBot, I think the question of what happens for ActorCriticBot is already a crux in this discussion, but feel free to add more bot types if you think it would be useful.
Is ActorCriticBot robust?
the actor in actor-critic doesn’t make its decisions by running an internal CriticEstimator(plan) and doing whatever evaluates best
Again, I’m not saying a brute force search over plans is being done here, but I’d generally expect that what the actor is doing is very strongly linked to what the critic values, and I’d say it’s very likely that the Actor has lots of components inside of it roughly related to the question “what is the critic going to think about this situation?” For example, if the critic consistently overvalues blue, then I’d predict that the actor has lots of circuits inside of it related to blueness. Do you disagree with this?
Obviously the actor’s ideas of what’s good aren’t going to be perfectly faithful to the critic: There will exist some adversarial plans that the actor just isn’t going to generate, but again the question is: Why should we be so lucky that the errors we’re making exactly cancel out? I don’t see any reason to expect that the actor’s imperfect approximation of the critic and critic’s imperfect approximation of our true desires should cancel out so well that the actor never generates any adversarial plans at all.
Cool, thanks for the reply, sounds like maybe a combination of 3a and the aspect of 1 where the shard points to a part of the world model? If no part of the agent is having its weights tuned to choose plans that make a shard happy, where would you say a shard mostly lives in an agent? World model? Somewhere else? Spread across multiple components? (At the bottom of this comment, I propose a different agent architecture that we can use to discuss this that I think fairly naturally matches the way you’ve been talking about shards.)
My model doesn’t predict that most agents will try to execute “fool myself into thinking I have a million bucks” style plans. If you think my model predicts that, then maybe this is an opportunity to make progress?
In my model, the agent is actually allowed to care about actual world states and not just its own internal activations. Consider two ways the agent could “fool” itself into thinking it had a million bucks:
Firstly, it could tamper with its own mind to create that impression. This could be either through hacking, or carefully spoofing its own sensory inputs. When planning, the predicted future states of the world are going to correctly show that the agent is fooling itself. So when the value function is fed the predicted world states, it’s going to rate them as bad, since in those world states, the agent does not have a million bucks. It doesn’t matter to the agent that later, after being hacked, the value function will think the agent has a million bucks. Right now, during planning, the value function isn’t fooled.
Secondly, it could create a million counterfeit bucks. Due to inaccuracies in training, maybe the agent actually thinks that having counterfeit money is just as good as real money. I.e. the value function does actually rate counterfeit bucks higher than real bucks. If so, then the agent is going to be perfectly satisfied with itself for coming up with this clever idea for satisfying its true values. The humans who were training the agent and wanted a million actual dollars won’t be satisfied, but that’s their problem, not the agent’s problem.
Okay, so here’s another possible agent design that we might be able to discuss: There’s a detailed world model that can answer various probabilistic queries about the future. Planning is done by formulating a query to the world model in the following way: Sample histories h according to:
P(h)=exp(U(h)/T)∑h′exp(U(h′)/T)
Where U(h) is utility assigned by the agent to a given history, and T is some small number which is analogous to temperature. Histories include the action taken by the agent, so we can sample with the action undetermined, and then actually take whichever action happened in the sampled history. (For simplicity, I’m assuming that the agent only takes one action per episode.)
To make this a “shardy” agent, we’d presumably have to replace U with something made out of shards. As long as it shows up like that in the equation for P(h), though, it looks like we’d have to make U adversarially robust. I’d be interested in what other modifications you’d want to make to this agent in order to make it properly shardy. This agent design does have the advantage that it seems like values are more directly related to the world model, since they’re able to directly examine h.
I’d say the shards live in the policy, basically, though these are all leaky abstractions. A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there’s a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.
I brought this up because I interpreted your previous comment as expressing skepticism that
Whereas I think that it will be true for analogous reasons as the reasons that explain why no part of the agent is “trying to make itself believe it has a million bucks”.
I have a vague feeling that the “value function map = agent’s true values” bit of this is part of the crux we’re disagreeing about.
Putting that aside, for this to happen, it has to be simultaneously true that the agent’s world model knows about and thinks about counterfeit money in particular (or else it won’t be able to construct viable plans that produce counterfeit money) while its value function does not know or think about counterfeit money in particular. It also has to be true that the agent tends to generate plans towards counterfeit money over plans towards real money, or else it’ll pick a real money plan it generates before it has had a chance to entertain a counterfeit money plan.
But during training, the plan generator was trained to generate plans that lead to real money. And the agent’s world model / plan generator knows (or at least thinks) that those plans were towards real money, even if its value function doesn’t know. This is because it takes different steps to acquire counterfeit money than to acquire real money. If the plan generator was optimized based on the training environment, and the agent was rewarded there for doing the sorts of things that lead to acquiring real money (which are different from the things that lead to counterfeit money), then those are the sorts of plans it will tend to generate. So why are we hypothesizing that the agent will tend to produce and choose the kinds of counterfeit money plans its value function would “fail” (from our perspective) on after training?
IMO an agent can’t work this way, at least not an embedded one. It would need to know what utility it would assign to each history in the distribution in order to sample proportional to the exponential of that history’s utility. But before it has sampled that history, it has not evaluated that history yet, so it doesn’t yet know what utility it assigns, so it can’t sample from such a distribution.
At best, the agent could sample from a learned history generator, one tuned on previous good histories, and then evaluate some number of possibilities from that distribution, picking one that’s good according to its evaluation. But that doesn’t require adversarial robustness, because the history generator will tend strongly to generate possibilities like the ones that evaluated/worked well in the past, which is exactly where the evaluations will tend to be fairly accurate. And the better the generator, the less options need to be sampled, so the less you’re “implicitly optimizing against” the evaluations.
Thanks for describing this. Technical question about this design: How are you getting the gradients that feed backwards into the action head? I assume it’s not supervised learning where it’s just trying to predict which action a human would take?
I’m aware of the issue of embedded agency, I just didn’t think it was relevant here. In this case, we can just assume that the world looks fairly Cartesian to the agent. The agent makes one decision (though possibly one from an exponentially large decision space) then shuts down and loses all its state. The record of the agent’s decision process in the future history of the world just shows up as thermal noise, and it’s unreasonable to expect the agent’s world model to account for thermal noise as anything other than a random variable. As a Cartesian hack, we can specify a probability distribution over actions for the world model to use when sampling. So for our particular query, we can specify a uniform distribution across all actions. Then in reality, the actual distribution over actions will be biased towards certain actions over others because they’re likely to result in higher utility.
This seems to be phrased in a weird way that rules out creative thinking. Nicola Tesla didn’t have three phase motors in his world-model before he invented them, but he was able to come up with them (his mind was able to generate a “three phase motor” plan) anyways. The key thing isn’t having a certain concept already existing in your world model because of prior experience. The requirement is just that the world model is able to reason about the thing. Nicola Tesla knew enough E&M to reason about three phase motors, and I expect that smart AIs will have world models that can easily reason about counterfeit money.
The job of a value function isn’t to know or think about things. It just gives either big numbers or small numbers when fed certain world states. The value function in question here gives a big number when you feed it a world state containing lots of counterfeit money. Does this mean it knows about counterfeit money? Maybe, but it doesn’t really matter.
A more relevant question is whether the plan-proposer knows the value function well enough that it knows the value function will give it points for suggesting the plan of producing counterfeit money. One could say probably yes, since it’s been getting all its gradients directly from the value function, or probably no, since there were no examples with counterfeit money in the training data. I’d say yes, but I’m guessing that this issue isn’t your crux, so I’ll only elaborate if asked.
This sounds like you’re talking about a dumb agent; smart agents generate and compare multiple plans and don’t just go with the first plan they think of.
Generalization. For a general agent, thinking about counterfeit money plans isn’t that much different than thinking of plans to make money. Like if the agent is able to think of money making plans like starting a restaurant, or working as a programmer, or opening a printing shop, then it should also be able to think of a counterfeit money making plan. (That plan is probably quite similar to the printing shop plan.)
Could be from rewards or other “external” feedback, could be from TD/bootstrapped errors, could be from an imitation loss or something else. The base case is probably just a plain ol’ rewards that get backpropagated through the action head via policy gradients.
Sorry for being unclear, I think you’re talking about a different dimension of embeddness than what I was pointing at. I was talking about the issue of logical uncertainty: that the agent needs to actually run computation in order to figure out certain things. The agent can’t magically sample from
P(h)
proportional toexp(U(h))
, because it needs theexp(U(h'))
of all the other histories first in order to weigh the distribution that way, which requires having already sampledh'
and having already calculatedU(h')
. But we are talking about how it samples a historyh
in the first place! The “At best” comment was proposing an alternative that might work, where the agent samples from a prior that’s been tuned based onU(h)
. Notice, though, that “our sampling is biased towards certain histories over others because they resulted in higher utility” does not imply “if a history would result in higher utility, then our sampling will bias towards it”.Consider a parallel situation: sampling images and picking one that gets the highest score on a non-robust face classifier. If we were able to sample from the distribution of images proportional to their (exp) face classifier scores, then we would need to worry a lot about picking an image that’s an adversarial example to our face classifier, because those can have absurdly high scores. But instead we need to sample images from the prior of a generative model like FFHQ StyleGAN2-ADA or DDPM, and score those images. A generative model like that will tend strongly to convert whatever input entropy you give it into a natural-looking image, so we can sample & filter from it a ton without worrying much about adversarial robustness. Even if you sample 10K images and pick the 5 with the highest face classifier scores, I am betting that the resulting images will still really be of faces, rather than classifier-adversarial examples. It is true that “our top-5 images are biased towards some images over others because they had higher face classification scores” but it is not true that “if adversarial examples would have higher face classification scores than regular images, then our sampling will bias towards adversarial examples.”
I disagree with this? The job of the value function is to look at the agent state (what the agent knows about the world) and estimate what the return will be based on that state & its implications. This involves “knowing things” and “thinking about things”. If the agent has a model
T(s, a, s')
that represents some state feature like “working a job” or “building a money-printer”, then that feature should exist in the state passed to the value functionV(s)
/Q(s, a)
as well, such that the value function will “know” about it and incorporate it into its estimation process.But it sounds like in the imagined scenario, the agent’s model and policy are sensitive to a bunch of stuff that the value function is blind to, which makes this configuration seem quite weird to me. If the value function was not blind to those features, then as the model goes from the training environment, where it got returns based on “getting paid money for tasks” (or whatever we rewarded for there) to the deployment environment, where the action space is even bigger, both the model/policy and the value function “generalize” and learn motivationally-relevant new facts that inform it what “getting paid money for tasks” looks like.
The plan-proposer isn’t trying to make the value function give it max points, it’s trying to [plan towards the current subgoal], which in this case is “make a million bucks”. The plan-proposer gets gradients/feedback from the value function, in the form of TD upates that tell it “this thought was current-subgoal-better/worse than expected upon consideration, do more/less of that in mental contexts like this”. But a thought like “Inspect my own value function to see what it’d give me positive TD updates for” evaluates as current-subgoal-worse for the subgoal of “make a million bucks” than a more actually-subgoal-relevant thought like “Think about what tasks are available nearby”, so the advantage is negative, which means a negative TD update, causing that thought to be progressively abandoned.
Once again, I agree that the agent is smart, and generalizes to be able to think of such a plan, and even plans much more sophisticated than that. I am arguing that it won’t particularly want to produce/execute those plans, unless its motivations & other cognition is biased in a very specific way.
Thanks for the reply. Just to prevent us from spinning our wheels too much, I’m going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we’re thinking of agents that work in different ways when making our points.
PolicyGradientBot: Defined by the following description:
ThermodynamicBot: Defined by the following description:
P(h)=exp(U(h)/T)∑h′exp(U(h')/T)
Comments on ThermodynamicBot
This bot is of course a bounded agent, and so the world model can’t be perfect, but consider the following steps:
For each possible h, compute U(h) and exp(U(h)/T).
Compute the sum Z=∑h′exp(U(h′)/T)
Now we know the probability for any given history h: It’s exp(U(h)/T)/Z.
This is a finite sequence of computational steps that terminates without self-reference, so no logical induction is needed here. Now you may fairly object that there is still an issue of computational complexity. The space of histories is exponentially large, so in practice the computation couldn’t be completed in time. This is the known-to-be-hard problem of computing the partition function. But the problem is tractable in many special cases, and humans get by well enough in our own reasoning about a world full of combinatorial explosions. We can suppose that, at the cost of making itself even more of an approximation, the world model has a way to efficiently sample from the distribution, even given the difficulty of computing Z. To take one particular concrete way this could be implemented, if the world model is a factor graph with few or no loops, then we can do the computation by adding on a few factors to account for exp(U(h)/T) and then using belief propagation to solve it.
I agree with your predictions for what would happen here. ThermodynamicBot isn’t really a GAN, though. If we tried to train a GAN to sub-in for ThermodynamicBot’s world model, then we could do it two ways (2nd way is most similar to your proposal). In both cases, the generator produces candidate histories, h.
Discriminator tries to predict what’s real (as usual), but generator rewards are shifted by U(h)/T to encourage it to produce outputs that are properly utility-weighted. (For simplicity, we can suppose that U(h) is differentiable so that gradients can flow back through it to the generator.) I expect this results in the generator producing adversarial examples if U is not adversarially robust.
Discriminator tries to predict what’s real and generator rewards are not shifted. Instead, we sample 10k histories from the generator and weight them all by utility. Or even just go full argmax and choose the history with the highest utility. 10k is not that large, it gives the agent about 14 bits of optimization power. In general, if it takes n bits of optimization power on the latent vector to convince the generator to construct an adversarial example, then we can safely use this scheme to provide slightly less than n bits of useful optimization. But we could do that multiple times, so maybe it’s a bit more useful than that. Also, maybe n is super large so it’s fine in practice?
Comments on PolicyGradientBot
So, the standard criticism of policy gradient is that it’s noisy and doesn’t really allow for credit assignment. In particular, I think the lack of credit assignment is really a crucial flaw that will prevent policy gradient from ever being used to create an AGI. As far as I know, no agent powered purely by policy gradient does anything particularly impressive, though I could be wrong about that. Do you have any arguments / ideas for how policy gradient could be made to work here?
General Clarifications of my Position
Just to clarify, my model doesn’t predict that the AI will use introspection on its own value function, or even look at its own source code at all. Some AIs may be designed to do that, but it’s not required for the failure case I’m considering. If gradients are flowing backwards from the value function to the actor (Bot-specific statement alert: Actor-Critic bots) then the actor has probably absorbed a significant amount of information about the value function’s misalignment. I don’t think it needs to take any additional steps of inspecting the agent’s own weights, or anything like that. You may object that gradients need not necessarily flow backwards there. After all, policy gradient and temporal difference learning are a thing. Let’s join the thread of discussion where you make that objection onto the PolicyGradientBot discussion.
Also, my model doesn’t predict that the agent’s subgoal reasoning will suddenly go off the rails and fail to achieve the subgoal in question because the agent was too busy thinking about how to counterfeit money. If you’re factoring the agent’s reasoning into subgoals, you have to remember that there’s a factor that actually sets the subgoals, and that’s where there’s the potential to go off the rails and start considering subgoals like “print lots of counterfeit money”. Obviously in the context where the agent is already considering the “get paid money for performing tasks” subgoal, the agent’s reasoning isn’t going to get screwed up.
Sounds good.
Comments on ThermodynamicBot
If we assume that the agent is making decisions by (approximately) plugging in every possible
h
intoU(h)
and picking based on (the partition function derived from) that, then of course you needU(h)
to be adversarially robust! I disagree with that as a model of how planning works or should work. IMO, not only is “plug every possibleh
intoU(h)
” extremely computationally infeasible, but even if it were feasible it would be a forseeably-broken (because fragile) planning strategy.Quote from a comment of TurnTrout about argmax planning, though I think it also applies to ThermodynamicBot, since that just does a softened version of argmax planning (converging to argmax planning as
T
->0):I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.
Right, I wasn’t thinking of it as actually a GAN, just giving an analogy where similar causal patterns are in play, to make my point clearer. But yeah, if we wanted to actually use a GAN, your suggestions sound reasonable.
Comments on PolicyGradientBot & General Position
I guess, but I’m confused why we’re talking about competitiveness all of a sudden. I mean, variants on policy gradient algorithms (PPO, Actor-Critic, etc.) do some impressive things (at least to the extent any RL algorithms currently do impressive things). And I can imagine more sophisticated versions of even plain policy gradient that would do impressive things, if the action space includes mental actions like “sample a rollout from my world model”.
But that’s neither here nor there IMO. In the previous comment, I tried to be clear that it makes me ~no difference where the gradients come from, when I said
Because I think that the outline of my argument doesn’t depend on one specific RL algorithm.
Consider ActorCriticBot. I would make the same argument for it as I did in my previous comment, that the actor is optimized by the critic’s outputs but that the actor is not optimizing for critic outputs. It does not particularly matter that
Critic("I made counterfeit money") >= Critic("I got money for doing the task")
or thatCritic("Extremely out-of-distribution state that Critic happens to evaluate super highly") >> Critic("I got money for doing the task")
, because the actor in actor-critic doesn’t make its decisions by running an internalCriticEstimator(plan)
and doing whatever evaluates best. It makes its decisions by looking at the current state and weighing the decision-factors triggered by that state; decision-factors that were internalized because they were upstream of actual positive feedback it received in the past from the environment+critic.In the training distribution, the agent never reached a state like
"I made counterfeit money"
, so the critic never gave feedback from that “misaligned” portion of the state space, so the actor never got gradients from the critic that differentially upweighted the actor’s concern for counterfeit money-specific factors, so the actor never internalized the particular antecedents of counterfeit money as motivating, so actor doesn’t factor them into its decision-making. Whereas the actor does factor things like “Am I doing the task yet?” into its decisions, because the actor internalized those as antecedents of reward/value, because the actor got gradients from the critic that differentially upweighted the agent’s concern those factors, because those were part of the state space covered by the training distribution, because the agent actually reached states where it got paid for e.g. task-completion.Comments on ThermodynamicBot
To be clear, I’m not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that’s a tree. Call this ThermodynamicBot-F. You could also imagine the role of “world model” being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBot-N.
Yes, I understand that running a search that will kill you if it succeeds is dumb. This has been known for many years. The question is how do we actually write a program to do a sane search? You quote TurnTrout:
I don’t find this particularly helpful. If we know which plans are adversarial so we can eliminate them from the search space, we’re already half way to solving alignment. I don’t think the plans a bounded agent is going to eliminate so that it can finish its thinking on time are automatically going to be the adversarial ones. I think this is a problem that is going to take actual effort.
In particular for ThermodynamicBot:
Case where the world model is implemented in a factor graph (ThermodynamicBot-F): This gives exactly the same result as searching across all inputs, but the computation is efficient, and not really wasteful in any sense. If we imagine trying to “improve” the belief propagation algorithm to simultaneously make it more efficient and also remove some subset of plans it’s searching over that are “adversarial”, I can’t really imagine a way to do that, and it would certainly make the algorithm more complicated and less elegant.
Case where a neural network world model is being used (ThermodynamicBot-N): In this case there are likely plans that will be missed by ThermodynamicBot-N because of the bounded nature of its world model, even though they would be found by searching across all inputs. But if we imagine training the world model to make it better, I would generally expect this to increase the world model’s ability to find adversarial plans just like it increases its ability to find good plans. In general, I don’t expect there to be any correlation where all the adversarial plans happen to be eliminated due to bounded reasoning. Why should we be so lucky that all the errors we’re making happen to cancel each other out?
I agree if we’re literally talking about brute force search here. If we’re talking about the more realistic ThermodynamicBot designs I’ve mentioned, then I’m not sure I agree. In some sense, all methods an agent could use to plan are “picking plans from plan-space that are better than most other plans”. Even ActorCriticBot is “trying” to approximate argmax. If we could train it to minimal loss, it would be an ArgMaxBot. Is there some particular approximation or heuristic that we can adopt, where if we do adopt it we go from dangerously approaching ArgMaxBot to safely searching through only good plans? An approximation used by ActorCriticBot, but not by ThermodynamicBot-N? If so, I have no idea what the crucial approximation is that you could be thinking of.
I also don’t think it’s at all obvious that ThermodynamicBot designs are necessarily capability-limited. It makes a lot of sense to integrate planning very closely with the world model. Might be worth betting on the direction of future RL research here if we can set sufficiently objective resolution criteria? In any case, I do think this counts as some progress in this discussion, since we’ve found an example of an agent that we both agree your argument doesn’t apply to.
Comments on PolicyGradientBot vs ActorCriticBot
In my view, there’s kind of a huge gulf between PolicyGradientBot and ActorCriticBot, where the gradients flowing backwards into ActorCriticBot’s actor end up carrying a lot of information. This allows for much better performance, and in particular much better sample efficiency, at the cost that some of the information is about weaknesses in ActorCriticBot’s critic.
To take a particular example, if the critic overvalues blue diamonds, then gradients flowing into the actor are going to be steeper for actions that obtain blue diamonds. Then in a new environment where there’s a bucket of blue paint sitting in the corner, it seems reasonable to expect that the actor might try to use that bucket to paint diamonds blue, at least assuming it’s sufficiently intelligent and flexible.
For PolicyGradientBot on the other hand, while it could still result in alignment failures, it seems much more like we’re just directly training a policy. But PolicyGradientBot is very slow when it comes to sample efficiency.
WRT other algorithms like temporal difference learning that lie kind of in between PolicyGradientBot and ActorCriticBot, I think the question of what happens for ActorCriticBot is already a crux in this discussion, but feel free to add more bot types if you think it would be useful.
Is ActorCriticBot robust?
Again, I’m not saying a brute force search over plans is being done here, but I’d generally expect that what the actor is doing is very strongly linked to what the critic values, and I’d say it’s very likely that the Actor has lots of components inside of it roughly related to the question “what is the critic going to think about this situation?” For example, if the critic consistently overvalues blue, then I’d predict that the actor has lots of circuits inside of it related to blueness. Do you disagree with this?
Obviously the actor’s ideas of what’s good aren’t going to be perfectly faithful to the critic: There will exist some adversarial plans that the actor just isn’t going to generate, but again the question is: Why should we be so lucky that the errors we’re making exactly cancel out? I don’t see any reason to expect that the actor’s imperfect approximation of the critic and critic’s imperfect approximation of our true desires should cancel out so well that the actor never generates any adversarial plans at all.