FWIW the kinds of agents I am imagining go beyond choosing their next actions, they also choose their next thoughts including thoughts-about-planning like “Call self.get_next_action(...) with this input”. That is the mechanism by which the agent binds its entire chain of thought—not just the final action produced—to its desires, by using planning reflectively as a tool to achieve ends.
If I’m getting your point right here, the basic idea is that the world model network is just being trained to get correct predictions, so that’s all okay, but the action network is not being trained to search out weird hacky exploits of the value function, because whenever it found one of those during training, the humans sent a negative reward, which caused the value function to self-correct, so then the action network didn’t get the reward it was hoping for.
Obstacles to this as an alignment solution:
No, that wasn’t intended to be my point. I wasn’t saying that I have an alignment solution, or saying that learning a correct-but-not-adversarially-robust value function and policy for Diamonds is something that we know how to do, or saying that doing so won’t be hard. The claim I was pushing back against was that the problem is adversarially hard. I don’t think you need a bunch of patches for it to be not-adversarially-hard, I think it is not-adversarially-hard by default.
Ok on to the substance:
You’re not going to keep training the other networks in the agent after you’ve finished training the value network right? Right? Okay, good, ’cause that would totally wreck everything. Moving along...
Whoa whoa no I think the agent will very much need to keep updating large parts of its value function along with its policy during deployment, so there’s no “after you’ve finished” (assuming you == the AI). That’s a significant component of my threat model. If you think an AGI without this could be competitive, I am curious how.
Most central concern: The deployment environment is generally not going to be identical to the training environment, and could be quite different. Basic scenario: The value network mostly learns to assign the right values to things in the context of the training environment, maybe it fails in a few places. The planning module’s search algorithms are adapted to work around those failures and not run into them while picking actions. Other than those patches, the planning module generally performs well, i.e. it acts like something that can reason about the world and select courses of action that are most likely to lead to a high score from the value function. Then switch to deployment. The value function suddenly has lots more holes that the planning module hasn’t developed patches to work around. Plus maybe some of the existing patches don’t generalize to the new environment. “Maybe we can engineer a type-X Liemond” is going to be an obvious thought to the planning module, when only type-Y Liemonds were possible in the training environment, and thus the planning module’s search process only has patches preventing it from discovering type-Y Liemonds.
I don’t really understand why this is the relevant scenario. The crux being discussed is whether the value function needs to be robust to adversarial distribution shifts, not whether it needs to be robust to ordinary non-adversarial distribution shifts. I think the relevant scenario for our thread would be an agent that correctly learned a value+policy function that picks out Diamonds in the training scenarios, and which learned a generally-correct concept of Diamonds, but where there are findable edge cases in its concept boundary such that it would count type-X Liemonds as Diamonds if presented with them.
The question, then, is why would it be thinking about how to produce type-X Liemonds in particular? What reason would the agent have to pursue thoughts that it thinks lead to type-X Liemonds over thoughts that it thinks lead to Diamonds? Presumably type-X Liemonds are produced by different means from how Diamonds are produced, since you mentioned they are easier to produce, so thoughts about how to produce one vs. thoughts about how to produce the other will be different. As far as the agent can tell, during training it only ever got reinforcement for Diamonds (the label for the cluster in concept-space), so the concepts that got painted with strong positive valence in its world model, such that it will actively build plans with them as targets, are ones like “Diamond” and “the process that I’ve seen produces Diamonds”. Whereas it has never had a reason to paint concepts like “finding a loophole in my Diamond concept” or “the process that produces type-X Liemonds” with that sort of valence, much less to make them plan targets.
The “try weird hacks” circuit stops getting reinforced, since the value function no longer recommends it, but there’s never any negative feedback against it, all the negative feedback happened in the value network, since that’s where gradients coming from human labels flow.
While the “try weird hacks” circuit is getting 0 net reinforcement for firing, the “build Diamonds” circuits are getting positive net reinforcement for firing, so on balance, relative to the other circuits in the agent that steer its thoughts and actions, the “try weird hacks” circuit is continually losing ground, its contribution to decisions diminishing as other circuits get their way more and more.
I think the agent will very much need to keep updating large parts of its value function along with its policy during deployment, so there’s no “after you’ve finished”
I think we agree here: As long as you’re updating the value function along with the rest of the agent, this won’t wreck everything. A slightly generalized version of what I was saying there still seems relevant to agents that are continually being updated: When you assign the agent tasks where you can’t label the results, you should still avoid updating any of the agent’s networks. Only updating non-value networks when you’re lacking labels to update the value network would probably still wreck everything, even if the agent will be given more labels in the future.
I don’t really understand why this is the relevant scenario. The crux being discussed is whether the value function needs to be robust to adversarial distribution shifts, not whether it needs to be robust to ordinary non-adversarial distribution shifts.
Okay, I have a completely different idea of what the crux is, so we probably need to figure this out before discussing much more, since this could be the whole reason for the disagreement. I’m definitely not saying the we need to prepare for the agent’s environment to undergo an adversarial distribution shift. The source of the adversarial inputs was always the agent itself, and those inputs are only adversarial from the perspective of the humans who trained the agent, they’re perfectly fine from the agent’s perspective. The distribution shift only reveals holes in the agent’s value function that weren’t possible in the training environment, since all the training environment holes that the agent was able to find got trained out. The agent itself will apply the adversarial pressure to exploit those holes. Does that, by any chance, completely resolve this discussion?
Okay, I have a completely different idea of what the crux is, so we probably need to figure this out before discussing much more, since this could be the whole reason for the disagreement. I’m definitely not saying the we need to prepare for the agent’s environment to undergo an adversarial distribution shift. The source of the adversarial inputs was always the agent itself, and those inputs are only adversarial from the perspective of the humans who trained the agent, they’re perfectly fine from the agent’s perspective. The distribution shift only reveals holes in the agent’s value function that weren’t possible in the training environment, since all the training environment holes that the agent was able to find got trained out. The agent itself will apply the adversarial pressure to exploit those holes. Does that, by any chance, completely resolve this discussion?
I understand what you mean but I still think it’s incorrect[1]. I think “The agent itself will apply the adversarial pressure to exploit those holes” (emphasis mine) is the key mistake. There are many directions in which the agent could apply optimization pressure and I think we are unfairly privileging the hypothesis that that direction will be towards “exploiting those holes” as opposed to all the other plausible directions, many of which are effectively orthogonal to “exploiting those holes”. I would agree with a version of your claim that said “could apply” but not with this one with “will apply”.
The mere fact that there exist possible inputs that would fall into the “holes” (from our perspective) in the agent’s value function does not mean that the agent will or even wants to try to steer itself towards those inputs rather than towards the non-”hole” high-value inputs. Remember that the trained circuits in the agent are what actually implement the agent’s decision-making, deciding what things to recognize and raise to attention, making choices about what things it will spend time thinking about, holding memories of plan-targets in mind; all based on past experiences & generalizations learned from them. Even though there is one or many nameless pattern of OOD "hole" input that would theoretically light up the agent’s value function (to MAX_INT or some other much-higher-than-desired level), that pattern is not a feature/pattern that the agent has ever actually seen, so its cognitive terrain has never gotten differential updates that were downstream of it, so its cognition doesn’t necessarily flow towards it. The circuits in the agent’s mind aren’t set up to automatically recognize or steer towards the state-action preconditions of that pattern, whereas they are set up to recognize and steer towards the state-action preconditions of prototypical rewarded input encountered during training. In my model, that is what happens when an agent robustly “wants” something.
I suspect that many other people share your view here, but I think that view is dead-wrong and is possibly at the root of other disagreements, which is why I’m trying so hard to communicate across the gap here.
Okay, cool, it seems like we’re on the same page, at least. So what I expect to happen for AGI is that the planning module will end up being a good general-purpose optimizer: Something that has a good model of the world, and uses it to find ways of increasing the score obtained from the value function. If there is an obvious way of increasing the score, then the planning module can be expected to discover it, and take it.
Scenario: We have managed to train a value function that values Liemonds as well as Diamonds. These both get a high score according to the value function, it doesn’t really distinguish between the two. In your model, the agent as a whole does distinguish between them, so there’s must be somewhere in the agent where the “Diamonds are better than Liemonds” information is stored, even if it’s implicit rather than showing up explicitly in the value function.
It sounds like what you’re saying is that networks related to the planning module stores this information. Rather than being a good general-purpose optimizer, the planning module has some plans that it just won’t consider. This property, of not considering certain plans that might break the value function, is what I referred to as “patches”. As far as I can tell, your model depends on these “patches” generalizing really well, so that all chains of reasoning that could lead to discovering Liemonds are blocked.
I find it fairly implausible that we’ll be able to get “don’t make any plans that break the value function” to generalize any better than the value function itself. If discovering Liemonds is like trying to find one of 1000 needles in a haystack, well finding needles in a haystack is what an optimizer is good at, and it’ll probably discover Liemonds pretty soon. Now if we restrict the allowed reasoning steps so that far fewer of the Liemond plans are thoughts the agent is even capable of thinking, well that’s now like finding one of the 5 remaining needles in the haystack, but we’ve still got an optimizer searching for that needle, and so we’re still in the adversarial situation.
Obviously the system can be made arbitrarily safe by weakening and restricting the optimizer enough, but that also stops our AI from being able to do much of anything.
You ask:
The question, then, is why would it be thinking about how to produce type-X Liemonds in particular? What reason would the agent have to pursue thoughts that it thinks lead to type-X Liemonds over thoughts that it thinks lead to Diamonds?
As maybe an intuition pump for this, let’s say there’s a human whose value function puts a high value on saving people’s lives. One strategy this person might think of is: “If I imagine uploading people into a computer simulation, and the simulation is designed to be pleasant and allow everyone uploaded into it to do things together, I think those people are very much count as alive, under my values. Therefore, if I pursue the strategy of helping people freeze their heads for later uploading, I will save lots of lives.”
That sure looks like a weird hack! Especially from the perspective of Evolution, since none of those uploaded humans have any genes at all any more! Why didn’t that human’s reflective decisions about what thoughts to think prevent it from even considering the question “could I upload people?” During that human’s entire training so far, it’s only had to deal with saving the lives of real live humans, with genes and everything. That’s where all its positive feelings about saving lives came from. Why is it suddenly considering this weird dumb thing!?
Looks like we’re popping back up to an earlier thread of the conversation? Curious what your thoughts on the parent comment were, but I will address your latest comment here. :)
So what I expect to happen for AGI is that the planning module will end up being a good general-purpose optimizer: Something that has a good model of the world, and uses it to find ways of increasing the score obtained from the value function. If there is an obvious way of increasing the score, then the planning module can be expected to discover it, and take it.
I think the AGI will be able to do general-purpose optimization, and that this could be implemented via an internal general-purpose optimization subroutine/module it calls (though, again, I think there’s a flavor of homunculus in this design that I dislike). I don’t see any reason to think of this subroutine as something that itself cares about value function scores, it’s just a generic function that will produce plans for any goal it gets asked to optimize towards. In my model, if there’s a general-purpose optimization subroutine, it takes as an argument the (sub)goal that the agent is currently thinking of optimizing towards and the subroutine spits out some answer, possibly internally making calls to itself as it splits the problem into smaller components.
In this model, it is false that the general-purpose optimization subroutine is freely trying to find ways to increase the value function scores. Reaching states with high value function scores is a side effect of the plans it outputs, but not the object-levelgoal of its planning. IF the agent were to set “maximize my nominal value function” as the (sub)goal that it inputs to this subroutine, THEN the subroutine will do what you described, and then the agent can decide what to do with those results. But I dispute that this is the default expectation we should have for how the agent will use the subroutine. Heck, you can do general-purpose optimization, and yet you don’t necessarily do that thing. Instead you ask the subroutine to help with the object-level goals that you already know you want, like “plan a route from here to the bar” and “find food to satisfy my current hunger” and “come up with a strategy to get a promotion at work”. The general-purpose optimization subroutine isn’t pulling you around. In fact, sometimes you reject its results outright, for example because the recommended course of action is too unusual or contradicts other goals of yours or is too extreme.
Scenario: We have managed to train a value function that values Liemonds as well as Diamonds. These both get a high score according to the value function, it doesn’t really distinguish between the two. In your model, the agent as a whole does distinguish between them, so there’s must be somewhere in the agent where the “Diamonds are better than Liemonds” information is stored, even if it’s implicit rather than showing up explicitly in the value function.
If the system as a whole literally cannot distinguish between Diamonds and Liemonds, even post-training, then yes you can fool it into making Liemonds. But I still don’t see why it would end up making Liemonds over Diamonds.
During training, the agent repeatedly sees these shiny things labeled as “Diamonds”, and keeps having reinforcement tied to getting these objects that it thinks correspond to a “Diamond” concept, and those positive memories build up a whole bunch of neural circuitry that fires in favor of actions and plans tied to the “Diamond” concept in its world model. When the agent goes out into the world, it has an internal goal of “how do I get Diamonds”. Let’s say the agent has no clue about Liemonds yet, so it has no conceptual distinction between Liemonds and Diamonds, so if it saw a Liemond it would gladly accept it because its value function would output a high score, thinking it had received a Diamond.
In order to achieve that internal goal, it can either use the Diamond-acquisition heuristics it learned during training (which, presumably, will not suddenly preferentially lead to Liemonds now, unless the distribution shift was adversarially constructed), or it can reach out into the world (through experimentation, through books, through the web, through humans) and try to figure out how to dereference its “Diamond” conceptual pointer and find out how to get those things. Maybe it directly investigates the English-language word “diamond”, maybe it looks for information about “clear brilliant hard things”, maybe it looks into “things that are associated with weddings in such-and-such way”. Whatever. When it does so, the world gives it a whole bunch of information that will be strongly correlated with how to get Diamonds, but not strongly correlated with how to get Liemonds. In order to find out how to get Liemonds, the agent would have to submit different queries to the world.
But the agent doesn’t have any reason to submit queries that favor Liemonds over Diamonds. At worst, it has reason to submit queries that don’t distinguish between Diamonds and Liemonds at all. At best, it has reason to submit a bunch of queries that cause it to correctly ground-out its previously-ambiguous pointer into the correct concept.
Rather than being a good general-purpose optimizer, the planning module has some plans that it just won’t consider. This property, of not considering certain plans that might break the value function, is what I referred to as “patches”.
It’s not that the planning module refuses to consider plans that might break the value function. The planning module is free to generate whatever plans it wants for the inputs it gets. It’s that the agent isn’t calling the planning module with “maximize nominal value function scores” as the (sub)goal, so the planning module has no reason to concoct those particular plans in the first place.
Meta-level comment: the space of possible plans is enormous and the world is fractally complex. In order to effectively plan towards nontrivial goals, the agent has to restrict its search in a boatload of ways. I don’t see why restricting the sorts of plans you consider would constitute a “patch” a priori, rather than just counting as “effectively searching for what you want”.
If discovering Liemonds is like trying to find one of 1000 needles in a haystack, well finding needles in a haystack is what an optimizer is good at, and it’ll probably discover Liemonds pretty soon. Now if we restrict the allowed reasoning steps so that far fewer of the Liemond plans are thoughts the agent is even capable of thinking, well that’s now like finding one of the 5 remaining needles in the haystack, but we’ve still got an optimizer searching for that needle, and so we’re still in the adversarial situation.
I don’t think this analogy works here. The optimizer isn’t looking for Liemonds specifically; it’s looking for “Diamonds”, a category which initially includes both Diamonds and Liemonds. The more it finds real Diamonds in the haystack, the more its policy and value function get reinforced towards reliably finding and accepting those things (ie Diamonds), & shaped away from finding and accepting Liemonds (and vice versa). Whether this happens depends in part on the relative abundance of the two. Moreover, that relative abundance isn’t fixed like it is with a haystack. The relative abundance of Diamonds and Liemonds in the agent’s search space depends entirely on the policy. If the policy is doing things like “learning how Diamonds are made” or “going to jewelry stores and asking for Diamonds”, then that distribution will have lots of Diamonds and synthetic Diamonds, but not Liemonds. Notice that these factors create a virtuous cycle, where the initial relative abundance of true Diamonds over Liemonds in its trajectories causes the agent to get better at targeting true Diamonds, which makes true Diamonds more abundant in its search space, and so on. This is why I don’t accept whole “eventually it’ll search for and find Liemonds” line of reasoning.
Notice also that all of this happens without weakening the optimizer. The agent wants Diamonds, and it wants to ground its conceptual pointer accurately in the world, keeping its concept in continuity with prototypical Diamonds it got training reinforcement for, and the agent ends up searching for real Diamonds with the full force of its creative and general-purpose search.
The optimizer isn’t looking for Liemonds specifically; it’s looking for “Diamonds”, a category which initially includes both Diamonds and Liemonds.
There are many directions in which the agent could apply optimization pressure and I think we are unfairly privileging the hypothesis that that direction will be towards “exploiting those holes” as opposed to all the other plausible directions, many of which are effectively orthogonal to “exploiting those holes”.
Just to clarify the parameters of the thought experiment, Liemonds are specified to be much easier to produce in large quantities than Diamonds, so the score attainable by producing them is many times higher than the maximum possible Diamond score.
The thing that stands out about the holes is that some of them allow the agent to (incorrectly) get an extraordinarily high score. The agent isn’t going to care about holes that allow it to get an incorrectly low score, or a score that is correct, but for weird incorrect reasons, though those kinds of hole will exist too.
The relative abundance of Diamonds and Liemonds in the agent’s search space depends entirely on the policy. If the policy is doing things like “learning how Diamonds are made” or “going to jewelry stores and asking for Diamonds”, then that distribution will have lots of Diamonds and synthetic Diamonds, but not Liemonds.
It seems like maybe the problem here is that you’re modelling the agent as fairly dumb, certainly dumber than a human level intelligence? Like, if the agent’s version of “how do I decide what to do next” is based entirely off of things like “do what worked before”, and doesn’t involve doing any actual original reasoning, then yeah, it’s probably not going to think of making Liemonds. Adversarial robustness is much easier if your adversary is not too smart.
I’m generally modelling the agent as being more intelligent: close to human if not greater than human. I generally expect that something this smart would think of “Liemonds”, in the same way that a human might think of “save people’s lives by uploading them” in my example above.
Just to clarify the parameters of the thought experiment, Liemonds are specified to be much easier to produce in large quantities than Diamonds, so the score attainable by producing them is many times higher than the maximum possible Diamond score.
The thing that stands out about the holes is that some of them allow the agent to (incorrectly) get an extraordinarily high score. The agent isn’t going to care about holes that allow it to get an incorrectly low score, or a score that is correct, but for weird incorrect reasons, though those kinds of hole will exist too.
I get that. In addition, Liemonds and Diamonds are in reality different objects that require different processes to acquire, right? Like, even though Liemonds are easier to produce in large quantities if that’s what you’re going for, you won’t automatically produce Liemonds on the route to producing Diamonds. If you’re trying to produce Diamonds, you can end up accidentally producing other junk by failing at diamond manufacturing, but you won’t accidentally produce Liemonds. So unless you are intentionally trying to produce Liemonds, say as an instrumental route to “produce the maximum possible Diamond score”, you won’t produce them.
It sounds like the reason you think the agent will intentionally produce Liemonds is as a instrumental route to getting the maximum possible Diamond score. I agree that that would be a great way to produce such a score. But AFAICT getting the maximum possible Diamond score is not “what the agent wants” in general. Reward is not the optimization target, and neither is the value function. Agents use a value function, but the agent’s goals =/= maximal scores from the value function. The value function aggregates information from the current agent state to forecast the reward signal. It’s not (inherently) a goal, an intention, a desire, or an objective. The agent could use high value function scores as a target in planning if it wanted to, but it could use anything it wants as a target in planning, the value function isn’t special in that regard. I expect that agents will use planning towards many different goals, and subgoals of those goals, and so on, with the highest level goals being about the concepts in the world model, not the literal outputs of the value function. I suspect you disagree with this and that this is a crux.
It seems like maybe the problem here is that you’re modelling the agent as fairly dumb, certainly dumber than a human level intelligence?
No, I am modeling the agent as being quite intelligent, at least as intelligent as a human. I just think it deploys that inteligence in service of a different motivational structure than you do.
Reward is not the optimization target, and neither is the value function.
Yeah, agree that reward is not the optimization target. Otherwise, the agent would just produce diamonds, since that’s what the rewards are actually given out for (or seize the reward channel, but we’ll ignore that for now). I’m a lot less sure that the value function is not the optimization target. Ignoring other architectures for the moment, consider a design where the agent has a value function and a world model, uses Monte-Carlo tree search, and picks the action that gives the highest expected score according to the value function. In that case, I’m pretty comfortable saying, “yeah, that particular type of agent is in fact optimizing for its value function”. Would you agree (just for that specific agent design)?
The value function aggregates information from the current agent state to forecast the reward signal. It’s not (inherently) a goal, an intention, a desire, or an objective. The agent could use high value function scores as a target in planning if it wanted to, but it could use anything it wants as a target in planning, the value function isn’t special in that regard. I expect that agents will use planning towards many different goals, and subgoals of those goals, and so on, with the highest level goals being about the concepts in the world model, not the literal outputs of the value function. I suspect you disagree with this and that this is a crux.
I think you’ve identified the crux correctly there, with the caveat in my position that the value function doesn’t necessarily have to be labelled “value function” on the blueprint.
Maybe the single most helpful thing would be if you just described the agent you have in mind as doing all these things in as much detail a possible. Like, the best thing would be on a level of describing what all the networks in the agent are, how they’re connected, and the gist of what they’re tuned to achieve during training. I’ll take whatever you can provide in terms of detail, though. Also, feel free to reduce down to the simplest agent that displays the robustness properties you’re describing here.
Ignoring other architectures for the moment, consider a design where the agent has a value function and a world model, uses Monte-Carlo tree search, and picks the action that gives the highest expected score according to the value function. In that case, I’m pretty comfortable saying, “yeah, that particular type of agent is in fact optimizing for its value function”. Would you agree (just for that specific agent design)?
I’m fine with describing that design like that. Though I expect we’d need a policy or some other method of proposing actions for the world model/MCTS to evaluate, or else we haven’t really specified the design of how the agent makes decisions.
Maybe the single most helpful thing would be if you just described the agent you have in mind as doing all these things in as much detail a possible. Like, the best thing would be on a level of describing what all the networks in the agent are, how they’re connected, and the gist of what they’re tuned to achieve during training. I’ll take whatever you can provide in terms of detail, though. Also, feel free to reduce down to the simplest agent that displays the robustness properties you’re describing here.
Hmm. I wasn’t imagining that any particularly exotic design choices were needed for my statements to hold, since I’ve mostly been arguing against things being required. What robustness properties are you asking about?
A shot at the diamond alignment problem is probably a good place to start, if you’re after a description of how the training process and internal cognition could work along a similar outline to what I was describing.
Sorry for the slow response, lots to read through and I’ve been kind of busy. Which of the following would you say most closely matches your model of how diamond alignment with shards works?
The diamond abstraction doesn’t have any holes in it where things like Liemonds could fit in, due to the natural abstraction hypothesis. The training process is able to find exactly this abstraction and include it in the agent’s world model. The diamond shard just points to the abstraction in the world model, and thus also has no holes.
Shards form a kind of vector space of desire. Rather than thinking of shards as distinct circuits, we should think of them as different amounts of pull in different directions. An agent that would pursue both diamonds and liemonds should be thought of as a linear combination of a diamond shard and a liemond shard. Thus, it makes sense to refer to a shard that points exactly in the direction of diamonds, with no liemond component. The liemond component might also be present in the agent, but we can conceptualize it as a separate shard.
Shards can be imperfect, the above two theories trying to make them perfect are silly. Shards aren’t like a value function that looks at the world and assigns a value. Instead...
a. …the agent is set up so that no part of the agent is “trying to make the shard happy”. (I think I’d still need a diagram that showed gradient flow here so I could satisfy myself that this was the case.)
b. …shards make bids on what the world should look like, using an internal language that is spoken by the world model. A diamond shard is just something that knows how to say “diamond” in this internal language.
c. …plans are generated by a “plausible plan generator” and then shards bid on those plans by the amount that they expect to be satisfied. Somehow the plausible plan generator is able to avoid “hacking” the shards by generating plans that they will incorrectly label as good.
3 is the closest. I don’t even know what it would mean for a shard to be “perfect”. I have a concept of diamonds in my world model, and shards attached to that concept. That concept includes some central features like hardness, and some peripheral features like associations with engagement rings. That concept doesn’t include everything about diamonds, though, and it probably includes some misconceptions and misgeneralizations. I could certainly be fooled in certain circumstances into accepting a fake diamond, for ex. if a seemingly-reputable jewelry store told me it was real.
Notice that there is a giant difference between “try to get diamonds” and “try to get the diamond-shard to be happy”, analogous to the difference between “try to make a million bucks” and “try to convince yourself you made a million bucks”. If I wanted to generate a plan to convince myself I’d made a million bucks, my plan-generator could, but I don’t want to, because that isn’t a strategy I expect to help me get the things I want, like a million bucks. My shards shape what plans I execute, including plans about how I should do planning. The shard is the thing doing optimization in conjunction with the rest of the agent, not the thing being optimized against. If there was a part of you trying to make the diamond shard happy, that’d be, like, a diamond-shard-shard (latched onto the concept of “my diamond shard” in your ontology).
Cool, thanks for the reply, sounds like maybe a combination of 3a and the aspect of 1 where the shard points to a part of the world model? If no part of the agent is having its weights tuned to choose plans that make a shard happy, where would you say a shard mostly lives in an agent? World model? Somewhere else? Spread across multiple components? (At the bottom of this comment, I propose a different agent architecture that we can use to discuss this that I think fairly naturally matches the way you’ve been talking about shards.)
Notice that there is a giant difference between “try to get diamonds” and “try to get the diamond-shard to be happy”, analogous to the difference between “try to make a million bucks” and “try to convince yourself you made a million bucks”. If I wanted to generate a plan to convince myself I’d made a million bucks, my plan-generator could, but I don’t want to, because that isn’t a strategy I expect to help me get the things I want, like a million bucks.
My model doesn’t predict that most agents will try to execute “fool myself into thinking I have a million bucks” style plans. If you think my model predicts that, then maybe this is an opportunity to make progress?
In my model, the agent is actually allowed to care about actual world states and not just its own internal activations. Consider two ways the agent could “fool” itself into thinking it had a million bucks:
Firstly, it could tamper with its own mind to create that impression. This could be either through hacking, or carefully spoofing its own sensory inputs. When planning, the predicted future states of the world are going to correctly show that the agent is fooling itself. So when the value function is fed the predicted world states, it’s going to rate them as bad, since in those world states, the agent does not have a million bucks. It doesn’t matter to the agent that later, after being hacked, the value function will think the agent has a million bucks. Right now, during planning, the value function isn’t fooled.
Secondly, it could create a million counterfeit bucks. Due to inaccuracies in training, maybe the agent actually thinks that having counterfeit money is just as good as real money. I.e. the value function does actually rate counterfeit bucks higher than real bucks. If so, then the agent is going to be perfectly satisfied with itself for coming up with this clever idea for satisfying its true values. The humans who were training the agent and wanted a million actual dollars won’t be satisfied, but that’s their problem, not the agent’s problem.
Okay, so here’s another possible agent design that we might be able to discuss: There’s a detailed world model that can answer various probabilistic queries about the future. Planning is done by formulating a query to the world model in the following way: Sample histories h according to:
P(h)=exp(U(h)/T)∑h′exp(U(h′)/T)
Where U(h) is utility assigned by the agent to a given history, and T is some small number which is analogous to temperature. Histories include the action taken by the agent, so we can sample with the action undetermined, and then actually take whichever action happened in the sampled history. (For simplicity, I’m assuming that the agent only takes one action per episode.)
To make this a “shardy” agent, we’d presumably have to replace U with something made out of shards. As long as it shows up like that in the equation for P(h), though, it looks like we’d have to make U adversarially robust. I’d be interested in what other modifications you’d want to make to this agent in order to make it properly shardy. This agent design does have the advantage that it seems like values are more directly related to the world model, since they’re able to directly examine h.
If no part of the agent is having its weights tuned to choose plans that make a shard happy, where would you say a shard mostly lives in an agent? World model? Somewhere else? Spread across multiple components?
I’d say the shards live in the policy, basically, though these are all leaky abstractions. A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there’s a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.
My model doesn’t predict that most agents will try to execute “fool myself into thinking I have a million bucks” style plans. If you think my model predicts that, then maybe this is an opportunity to make progress?
I brought this up because I interpreted your previous comment as expressing skepticism that
no part of the agent is “trying to make the shard happy”
Whereas I think that it will be true for analogous reasons as the reasons that explain why no part of the agent is “trying to make itself believe it has a million bucks”.
Due to inaccuracies in training, maybe the agent actually thinks that having counterfeit money is just as good as real money. I.e. the value function does actually rate counterfeit bucks higher than real bucks. If so, then the agent is going to be perfectly satisfied with itself for coming up with this clever idea for satisfying its true values.
I have a vague feeling that the “value function map = agent’s true values” bit of this is part of the crux we’re disagreeing about.
Putting that aside, for this to happen, it has to be simultaneously true that the agent’s world model knows about and thinks about counterfeit money in particular (or else it won’t be able to construct viable plans that produce counterfeit money) while its value function does not know or think about counterfeit money in particular. It also has to be true that the agent tends to generate plans towards counterfeit money over plans towards real money, or else it’ll pick a real money plan it generates before it has had a chance to entertain a counterfeit money plan.
But during training, the plan generator was trained to generate plans that lead to real money. And the agent’s world model / plan generator knows (or at least thinks) that those plans were towards real money, even if its value function doesn’t know. This is because it takes different steps to acquire counterfeit money than to acquire real money. If the plan generator was optimized based on the training environment, and the agent was rewarded there for doing the sorts of things that lead to acquiring real money (which are different from the things that lead to counterfeit money), then those are the sorts of plans it will tend to generate. So why are we hypothesizing that the agent will tend to produce and choose the kinds of counterfeit money plans its value function would “fail” (from our perspective) on after training?
Okay, so here’s another possible agent design that we might be able to discuss:
At best, the agent could sample from a learned history generator, one tuned on previous good histories, and then evaluate some number of possibilities from that distribution, picking one that’s good according to its evaluation. But that doesn’t require adversarial robustness, because the history generator will tend strongly to generate possibilities like the ones that evaluated/worked well in the past, which is exactly where the evaluations will tend to be fairly accurate. And the better the generator, the less options need to be sampled, so the less you’re “implicitly optimizing against” the evaluations.
A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there’s a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.
Thanks for describing this. Technical question about this design: How are you getting the gradients that feed backwards into the action head? I assume it’s not supervised learning where it’s just trying to predict which action a human would take?
IMO an agent can’t work this way, at least not an embedded one.
I’m aware of the issue of embedded agency, I just didn’t think it was relevant here. In this case, we can just assume that the world looks fairly Cartesian to the agent. The agent makes one decision (though possibly one from an exponentially large decision space) then shuts down and loses all its state. The record of the agent’s decision process in the future history of the world just shows up as thermal noise, and it’s unreasonable to expect the agent’s world model to account for thermal noise as anything other than a random variable. As a Cartesian hack, we can specify a probability distribution over actions for the world model to use when sampling. So for our particular query, we can specify a uniform distribution across all actions. Then in reality, the actual distribution over actions will be biased towards certain actions over others because they’re likely to result in higher utility.
Putting that aside, for this to happen, it has to be simultaneously true that the agent’s world model knows about and thinks about counterfeit money in particular (or else it won’t be able to construct viable plans that produce counterfeit money)
This seems to be phrased in a weird way that rules out creative thinking. Nicola Tesla didn’t have three phase motors in his world-model before he invented them, but he was able to come up with them (his mind was able to generate a “three phase motor” plan) anyways. The key thing isn’t having a certain concept already existing in your world model because of prior experience. The requirement is just that the world model is able to reason about the thing. Nicola Tesla knew enough E&M to reason about three phase motors, and I expect that smart AIs will have world models that can easily reason about counterfeit money.
while its value function does not know or think about counterfeit money in particular.
The job of a value function isn’t to know or think about things. It just gives either big numbers or small numbers when fed certain world states. The value function in question here gives a big number when you feed it a world state containing lots of counterfeit money. Does this mean it knows about counterfeit money? Maybe, but it doesn’t really matter.
A more relevant question is whether the plan-proposer knows the value function well enough that it knows the value function will give it points for suggesting the plan of producing counterfeit money. One could say probably yes, since it’s been getting all its gradients directly from the value function, or probably no, since there were no examples with counterfeit money in the training data. I’d say yes, but I’m guessing that this issue isn’t your crux, so I’ll only elaborate if asked.
It also has to be true that the agent tends to generate plans towards counterfeit money over plans towards real money, or else it’ll pick a real money plan it generates before it has had a chance to entertain a counterfeit money plan.
This sounds like you’re talking about a dumb agent; smart agents generate and compare multiple plans and don’t just go with the first plan they think of.
So why are we hypothesizing that the agent will tend to produce and choose the kinds of counterfeit money plans its value function would “fail” (from our perspective) on after training?
Generalization. For a general agent, thinking about counterfeit money plans isn’t that much different than thinking of plans to make money. Like if the agent is able to think of money making plans like starting a restaurant, or working as a programmer, or opening a printing shop, then it should also be able to think of a counterfeit money making plan. (That plan is probably quite similar to the printing shop plan.)
Thanks for describing this. Technical question about this design: How are you getting the gradients that feed backwards into the action head? I assume it’s not supervised learning where it’s just trying to predict which action a human would take?
Could be from rewards or other “external” feedback, could be from TD/bootstrapped errors, could be from an imitation loss or something else. The base case is probably just a plain ol’ rewards that get backpropagated through the action head via policy gradients.
In this case, we can just assume that the world looks fairly Cartesian to the agent. The agent makes one decision (though possibly one from an exponentially large decision space) then shuts down and loses all its state. The record of the agent’s decision process in the future history of the world just shows up as thermal noise, and it’s unreasonable to expect the agent’s world model to account for thermal noise as anything other than a random variable. As a Cartesian hack, we can specify a probability distribution over actions for the world model to use when sampling. So for our particular query, we can specify a uniform distribution across all actions. Then in reality, the actual distribution over actions will be biased towards certain actions over others because they’re likely to result in higher utility.
Sorry for being unclear, I think you’re talking about a different dimension of embeddness than what I was pointing at. I was talking about the issue of logical uncertainty: that the agent needs to actually run computation in order to figure out certain things. The agent can’t magically sample from P(h) proportional to exp(U(h)), because it needs the exp(U(h')) of all the other histories first in order to weigh the distribution that way, which requires having already sampled h' and having already calculated U(h'). But we are talking about how it samples a history h in the first place! The “At best” comment was proposing an alternative that might work, where the agent samples from a prior that’s been tuned based on U(h). Notice, though, that “our sampling is biased towards certain histories over others because they resulted in higher utility” does not imply “if a history would result in higher utility, then our sampling will bias towards it”.
Consider a parallel situation: sampling images and picking one that gets the highest score on a non-robust face classifier. If we were able to sample from the distribution of images proportional to their (exp) face classifier scores, then we would need to worry a lot about picking an image that’s an adversarial example to our face classifier, because those can have absurdly high scores. But instead we need to sample images from the prior of a generative model like FFHQ StyleGAN2-ADA or DDPM, and score those images. A generative model like that will tend strongly to convert whatever input entropy you give it into a natural-looking image, so we can sample & filter from it a ton without worrying much about adversarial robustness. Even if you sample 10K images and pick the 5 with the highest face classifier scores, I am betting that the resulting images will still really be of faces, rather than classifier-adversarial examples. It is true that “our top-5 images are biased towards some images over others because they had higher face classification scores” but it is not true that “if adversarial examples would have higher face classification scores than regular images, then our sampling will bias towards adversarial examples.”
The job of a value function isn’t to know or think about things. It just gives either big numbers or small numbers when fed certain world states. The value function in question here gives a big number when you feed it a world state containing lots of counterfeit money. Does this mean it knows about counterfeit money? Maybe, but it doesn’t really matter.
I disagree with this? The job of the value function is to look at the agent state (what the agent knows about the world) and estimate what the return will be based on that state & its implications. This involves “knowing things” and “thinking about things”. If the agent has a model T(s, a, s') that represents some state feature like “working a job” or “building a money-printer”, then that feature should exist in the state passed to the value function V(s)/Q(s, a) as well, such that the value function will “know” about it and incorporate it into its estimation process.
But it sounds like in the imagined scenario, the agent’s model and policy are sensitive to a bunch of stuff that the value function is blind to, which makes this configuration seem quite weird to me. If the value function was not blind to those features, then as the model goes from the training environment, where it got returns based on “getting paid money for tasks” (or whatever we rewarded for there) to the deployment environment, where the action space is even bigger, both the model/policy and the value function “generalize” and learn motivationally-relevant new facts that inform it what “getting paid money for tasks” looks like.
A more relevant question is whether the plan-proposer knows the value function well enough that it knows the value function will give it points for suggesting the plan of producing counterfeit money. One could say probably yes, since it’s been getting all its gradients directly from the value function, or probably no, since there were no examples with counterfeit money in the training data. I’d say yes, but I’m guessing that this issue isn’t your crux, so I’ll only elaborate if asked.
The plan-proposer isn’t trying to make the value function give it max points, it’s trying to [plan towards the current subgoal], which in this case is “make a million bucks”. The plan-proposer gets gradients/feedback from the value function, in the form of TD upates that tell it “this thought was current-subgoal-better/worse than expected upon consideration, do more/less of that in mental contexts like this”. But a thought like “Inspect my own value function to see what it’d give me positive TD updates for” evaluates as current-subgoal-worse for the subgoal of “make a million bucks” than a more actually-subgoal-relevant thought like “Think about what tasks are available nearby”, so the advantage is negative, which means a negative TD update, causing that thought to be progressively abandoned.
This sounds like you’re talking about a dumb agent; smart agents generate and compare multiple plans and don’t just go with the first plan they think of.
Generalization. For a general agent, thinking about counterfeit money plans isn’t that much different than thinking of plans to make money. Like if the agent is able to think of money making plans like starting a restaurant, or working as a programmer, or opening a printing shop, then it should also be able to think of a counterfeit money making plan. (That plan is probably quite similar to the printing shop plan.)
Once again, I agree that the agent is smart, and generalizes to be able to think of such a plan, and even plans much more sophisticated than that. I am arguing that it won’t particularly want to produce/execute those plans, unless its motivations & other cognition is biased in a very specific way.
Thanks for the reply. Just to prevent us from spinning our wheels too much, I’m going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we’re thinking of agents that work in different ways when making our points.
PolicyGradientBot: Defined by the following description:
A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there’s a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.
The base case is probably just a plain ol’ rewards that get backpropagated through the action head via policy gradients.
ThermodynamicBot: Defined by the following description:
There’s a detailed world model that can answer various probabilistic queries about the future. Planning is done by formulating a query to the world model in the following way: Sample histories h according to:
P(h)=exp(U(h)/T)∑h′exp(U(h')/T)
Where U(h) is utility assigned by the agent to a given history, and T is some small number which is analogous to temperature.
As a Cartesian hack, we can specify a probability distribution over actions for the world model to use when sampling. So for our particular query, we can specify a uniform distribution across all actions. Then in reality, the actual distribution over actions will be biased towards certain actions over others because they’re likely to result in higher utility.
Comments on ThermodynamicBot
Sorry for being unclear, I think you’re talking about a different dimension of embeddness than what I was pointing at. I was talking about the issue of logical uncertainty: that the agent needs to actually run computation in order to figure out certain things. The agent can’t magically sample from P(h) proportional to exp(U(h)), because it needs the exp(U(h’)) of all the other histories first in order to weigh the distribution that way, which requires having already sampled h’ and having already calculated U(h’). But we are talking about how it samples a history h in the first place! The “At best” comment was proposing an alternative that might work, where the agent samples from a prior that’s been tuned based on U(h). Notice, though, that “our sampling is biased towards certain histories over others because they resulted in higher utility” does not imply “if a history would result in higher utility, then our sampling will bias towards it”.
This bot is of course a bounded agent, and so the world model can’t be perfect, but consider the following steps:
For each possible h, compute U(h) and exp(U(h)/T).
Compute the sum Z=∑h′exp(U(h′)/T)
Now we know the probability for any given history h: It’s exp(U(h)/T)/Z.
This is a finite sequence of computational steps that terminates without self-reference, so no logical induction is needed here. Now you may fairly object that there is still an issue of computational complexity. The space of histories is exponentially large, so in practice the computation couldn’t be completed in time. This is the known-to-be-hard problem of computing the partition function. But the problem is tractable in many special cases, and humans get by well enough in our own reasoning about a world full of combinatorial explosions. We can suppose that, at the cost of making itself even more of an approximation, the world model has a way to efficiently sample from the distribution, even given the difficulty of computing Z. To take one particular concrete way this could be implemented, if the world model is a factor graph with few or no loops, then we can do the computation by adding on a few factors to account for exp(U(h)/T) and then using belief propagation to solve it.
Consider a parallel situation: sampling images and picking one that gets the highest score on a non-robust face classifier. If we were able to sample from the distribution of images proportional to their (exp) face classifier scores, then we would need to worry a lot about picking an image that’s an adversarial example to our face classifier, because those can have absurdly high scores. But instead we need to sample images from the prior of a generative model like FFHQ StyleGAN2-ADA or DDPM, and score those images. A generative model like that will tend strongly to convert whatever input entropy you give it into a natural-looking image, so we can sample & filter from it a ton without worrying much about adversarial robustness. Even if you sample 10K images and pick the 5 with the highest face classifier scores, I am betting that the resulting images will still really be of faces, rather than classifier-adversarial examples. It is true that “our top-5 images are biased towards some images over others because they had higher face classification scores” but it is not true that “if adversarial examples would have higher face classification scores than regular images, then our sampling will bias towards adversarial examples.”
I agree with your predictions for what would happen here. ThermodynamicBot isn’t really a GAN, though. If we tried to train a GAN to sub-in for ThermodynamicBot’s world model, then we could do it two ways (2nd way is most similar to your proposal). In both cases, the generator produces candidate histories, h.
Discriminator tries to predict what’s real (as usual), but generator rewards are shifted by U(h)/T to encourage it to produce outputs that are properly utility-weighted. (For simplicity, we can suppose that U(h) is differentiable so that gradients can flow back through it to the generator.) I expect this results in the generator producing adversarial examples if U is not adversarially robust.
Discriminator tries to predict what’s real and generator rewards are not shifted. Instead, we sample 10k histories from the generator and weight them all by utility. Or even just go full argmax and choose the history with the highest utility. 10k is not that large, it gives the agent about 14 bits of optimization power. In general, if it takes n bits of optimization power on the latent vector to convince the generator to construct an adversarial example, then we can safely use this scheme to provide slightly less than n bits of useful optimization. But we could do that multiple times, so maybe it’s a bit more useful than that. Also, maybe n is super large so it’s fine in practice?
Comments on PolicyGradientBot
So, the standard criticism of policy gradient is that it’s noisy and doesn’t really allow for credit assignment. In particular, I think the lack of credit assignment is really a crucial flaw that will prevent policy gradient from ever being used to create an AGI. As far as I know, no agent powered purely by policy gradient does anything particularly impressive, though I could be wrong about that. Do you have any arguments / ideas for how policy gradient could be made to work here?
General Clarifications of my Position
But a thought like “Inspect my own value function to see what it’d give me positive TD updates for” evaluates as current-subgoal-worse for the subgoal of “make a million bucks” than a more actually-subgoal-relevant thought
Just to clarify, my model doesn’t predict that the AI will use introspection on its own value function, or even look at its own source code at all. Some AIs may be designed to do that, but it’s not required for the failure case I’m considering. If gradients are flowing backwards from the value function to the actor (Bot-specific statement alert: Actor-Critic bots) then the actor has probably absorbed a significant amount of information about the value function’s misalignment. I don’t think it needs to take any additional steps of inspecting the agent’s own weights, or anything like that. You may object that gradients need not necessarily flow backwards there. After all, policy gradient and temporal difference learning are a thing. Let’s join the thread of discussion where you make that objection onto the PolicyGradientBot discussion.
Also, my model doesn’t predict that the agent’s subgoal reasoning will suddenly go off the rails and fail to achieve the subgoal in question because the agent was too busy thinking about how to counterfeit money. If you’re factoring the agent’s reasoning into subgoals, you have to remember that there’s a factor that actually sets the subgoals, and that’s where there’s the potential to go off the rails and start considering subgoals like “print lots of counterfeit money”. Obviously in the context where the agent is already considering the “get paid money for performing tasks” subgoal, the agent’s reasoning isn’t going to get screwed up.
Thanks for the reply. Just to prevent us from spinning our wheels too much, I’m going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we’re thinking of agents that work in different ways when making our points.
Sounds good.
Comments on ThermodynamicBot
This bot is of course a bounded agent, and so the world model can’t be perfect, but consider the following steps: [...] This is a finite sequence of computational steps that terminates without self-reference, so no logical induction is needed here. Now you may fairly object that there is still an issue of computational complexity. The space of histories is exponentially large, so in practice the computation couldn’t be completed in time. This is the known-to-be-hard problem of computing the partition function. But the problem is tractable in many special cases, and humans get by well enough in our own reasoning about a world full of combinatorial explosions. We can suppose that, at the cost of making itself even more of an approximation, the world model has a way to efficiently sample from the distribution, even given the difficulty of computing Z. To take one particular concrete way this could be implemented, if the world model is a factor graph with few or no loops, then we can do the computation by adding on a few factors to account for exp(U(h)/T) and then using belief propagation to solve it.
If we assume that the agent is making decisions by (approximately) plugging in every possible h into U(h) and picking based on (the partition function derived from) that, then of course you need U(h) to be adversarially robust! I disagree with that as a model of how planning works or should work. IMO, not only is “plug every possible h into U(h)” extremely computationally infeasible, but even if it were feasible it would be a forseeably-broken (because fragile) planning strategy.
Quote from a comment of TurnTrout about argmax planning, though I think it also applies to ThermodynamicBot, since that just does a softened version of argmax planning (converging to argmax planning as T->0):
if you design an AI that doesn’t argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn’t mean giving up argmax.
This seems exactly backwards to me. Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you’re also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I’d have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I’d have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.
I agree with your predictions for what would happen here. ThermodynamicBot isn’t really a GAN, though. If we tried to train a GAN to sub-in for ThermodynamicBot’s world model, then we could do it two ways (2nd way is most similar to your proposal). In both cases, the generator produces candidate histories, h.
Right, I wasn’t thinking of it as actually a GAN, just giving an analogy where similar causal patterns are in play, to make my point clearer. But yeah, if we wanted to actually use a GAN, your suggestions sound reasonable.
Comments on PolicyGradientBot & General Position
So, the standard criticism of policy gradient is that it’s noisy and doesn’t really allow for credit assignment. In particular, I think the lack of credit assignment is really a crucial flaw that will prevent policy gradient from ever being used to create an AGI. As far as I know, no agent powered purely by policy gradient does anything particularly impressive, though I could be wrong about that. Do you have any arguments / ideas for how policy gradient could be made to work here?
I guess, but I’m confused why we’re talking about competitiveness all of a sudden. I mean, variants on policy gradient algorithms (PPO, Actor-Critic, etc.) do some impressive things (at least to the extent any RL algorithms currently do impressive things). And I can imagine more sophisticated versions of even plain policy gradient that would do impressive things, if the action space includes mental actions like “sample a rollout from my world model”.
But that’s neither here nor there IMO. In the previous comment, I tried to be clear that it makes me ~no difference where the gradients come from, when I said
Could be from rewards or other “external” feedback, could be from TD/bootstrapped errors, could be from an imitation loss or something else.
Because I think that the outline of my argument doesn’t depend on one specific RL algorithm.
Just to clarify, my model doesn’t predict that the AI will use introspection on its own value function, or even look at its own source code at all. Some AIs may be designed to do that, but it’s not required for the failure case I’m considering. If gradients are flowing backwards from the value function to the actor (Bot-specific statement alert: Actor-Critic bots) then the actor has probably absorbed a significant amount of information about the value function’s misalignment. I don’t think it needs to take any additional steps of inspecting the agent’s own weights, or anything like that. You may object that gradients need not necessarily flow backwards there. After all, policy gradient and temporal difference learning are a thing. Let’s join the thread of discussion where you make that objection onto the PolicyGradientBot discussion.
Consider ActorCriticBot. I would make the same argument for it as I did in my previous comment, that the actor is optimized by the critic’s outputs but that the actor is not optimizing for critic outputs. It does not particularly matter that Critic("I made counterfeit money") >= Critic("I got money for doing the task") or that Critic("Extremely out-of-distribution state that Critic happens to evaluate super highly") >> Critic("I got money for doing the task"), because the actor in actor-critic doesn’t make its decisions by running an internal CriticEstimator(plan) and doing whatever evaluates best. It makes its decisions by looking at the current state and weighing the decision-factors triggered by that state; decision-factors that were internalized because they were upstream of actual positive feedback it received in the past from the environment+critic.
In the training distribution, the agent never reached a state like "I made counterfeit money", so the critic never gave feedback from that “misaligned” portion of the state space, so the actor never got gradients from the critic that differentially upweighted the actor’s concern for counterfeit money-specific factors, so the actor never internalized the particular antecedents of counterfeit money as motivating, so actor doesn’t factor them into its decision-making. Whereas the actor does factor things like “Am I doing the task yet?” into its decisions, because the actor internalized those as antecedents of reward/value, because the actor got gradients from the critic that differentially upweighted the agent’s concern those factors, because those were part of the state space covered by the training distribution, because the agent actually reached states where it got paid for e.g. task-completion.
IMO, not only is “plug every possible h into U(h)” extremely computationally infeasible
To be clear, I’m not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that’s a tree. Call this ThermodynamicBot-F. You could also imagine the role of “world model” being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBot-N.
Yes, I understand that running a search that will kill you if it succeeds is dumb. This has been known for many years. The question is how do we actually write a program to do a sane search? You quote TurnTrout:
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
I don’t find this particularly helpful. If we know which plans are adversarial so we can eliminate them from the search space, we’re already half way to solving alignment. I don’t think the plans a bounded agent is going to eliminate so that it can finish its thinking on time are automatically going to be the adversarial ones. I think this is a problem that is going to take actual effort.
In particular for ThermodynamicBot:
Case where the world model is implemented in a factor graph (ThermodynamicBot-F): This gives exactly the same result as searching across all inputs, but the computation is efficient, and not really wasteful in any sense. If we imagine trying to “improve” the belief propagation algorithm to simultaneously make it more efficient and also remove some subset of plans it’s searching over that are “adversarial”, I can’t really imagine a way to do that, and it would certainly make the algorithm more complicated and less elegant.
Case where a neural network world model is being used (ThermodynamicBot-N): In this case there are likely plans that will be missed by ThermodynamicBot-N because of the bounded nature of its world model, even though they would be found by searching across all inputs. But if we imagine training the world model to make it better, I would generally expect this to increase the world model’s ability to find adversarial plans just like it increases its ability to find good plans. In general, I don’t expect there to be any correlation where all the adversarial plans happen to be eliminated due to bounded reasoning. Why should we be so lucky that all the errors we’re making happen to cancel each other out?
I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.
I agree if we’re literally talking about brute force search here. If we’re talking about the more realistic ThermodynamicBot designs I’ve mentioned, then I’m not sure I agree. In some sense, all methods an agent could use to plan are “picking plans from plan-space that are better than most other plans”. Even ActorCriticBot is “trying” to approximate argmax. If we could train it to minimal loss, it would be an ArgMaxBot. Is there some particular approximation or heuristic that we can adopt, where if we do adopt it we go from dangerously approaching ArgMaxBot to safely searching through only good plans? An approximation used by ActorCriticBot, but not by ThermodynamicBot-N? If so, I have no idea what the crucial approximation is that you could be thinking of.
I also don’t think it’s at all obvious that ThermodynamicBot designs are necessarily capability-limited. It makes a lot of sense to integrate planning very closely with the world model. Might be worth betting on the direction of future RL research here if we can set sufficiently objective resolution criteria? In any case, I do think this counts as some progress in this discussion, since we’ve found an example of an agent that we both agree your argument doesn’t apply to.
Comments on PolicyGradientBot vs ActorCriticBot
In my view, there’s kind of a huge gulf between PolicyGradientBot and ActorCriticBot, where the gradients flowing backwards into ActorCriticBot’s actor end up carrying a lot of information. This allows for much better performance, and in particular much better sample efficiency, at the cost that some of the information is about weaknesses in ActorCriticBot’s critic.
To take a particular example, if the critic overvalues blue diamonds, then gradients flowing into the actor are going to be steeper for actions that obtain blue diamonds. Then in a new environment where there’s a bucket of blue paint sitting in the corner, it seems reasonable to expect that the actor might try to use that bucket to paint diamonds blue, at least assuming it’s sufficiently intelligent and flexible.
For PolicyGradientBot on the other hand, while it could still result in alignment failures, it seems much more like we’re just directly training a policy. But PolicyGradientBot is very slow when it comes to sample efficiency.
WRT other algorithms like temporal difference learning that lie kind of in between PolicyGradientBot and ActorCriticBot, I think the question of what happens for ActorCriticBot is already a crux in this discussion, but feel free to add more bot types if you think it would be useful.
Is ActorCriticBot robust?
the actor in actor-critic doesn’t make its decisions by running an internal CriticEstimator(plan) and doing whatever evaluates best
Again, I’m not saying a brute force search over plans is being done here, but I’d generally expect that what the actor is doing is very strongly linked to what the critic values, and I’d say it’s very likely that the Actor has lots of components inside of it roughly related to the question “what is the critic going to think about this situation?” For example, if the critic consistently overvalues blue, then I’d predict that the actor has lots of circuits inside of it related to blueness. Do you disagree with this?
Obviously the actor’s ideas of what’s good aren’t going to be perfectly faithful to the critic: There will exist some adversarial plans that the actor just isn’t going to generate, but again the question is: Why should we be so lucky that the errors we’re making exactly cancel out? I don’t see any reason to expect that the actor’s imperfect approximation of the critic and critic’s imperfect approximation of our true desires should cancel out so well that the actor never generates any adversarial plans at all.
FWIW the kinds of agents I am imagining go beyond choosing their next actions, they also choose their next thoughts including thoughts-about-planning like “Call self.get_next_action(...) with this input”. That is the mechanism by which the agent binds its entire chain of thought—not just the final action produced—to its desires, by using planning reflectively as a tool to achieve ends.
No, that wasn’t intended to be my point. I wasn’t saying that I have an alignment solution, or saying that learning a correct-but-not-adversarially-robust value function and policy for Diamonds is something that we know how to do, or saying that doing so won’t be hard. The claim I was pushing back against was that the problem is adversarially hard. I don’t think you need a bunch of patches for it to be not-adversarially-hard, I think it is not-adversarially-hard by default.
Ok on to the substance:
Whoa whoa no I think the agent will very much need to keep updating large parts of its value function along with its policy during deployment, so there’s no “after you’ve finished” (assuming you == the AI). That’s a significant component of my threat model. If you think an AGI without this could be competitive, I am curious how.
I don’t really understand why this is the relevant scenario. The crux being discussed is whether the value function needs to be robust to adversarial distribution shifts, not whether it needs to be robust to ordinary non-adversarial distribution shifts. I think the relevant scenario for our thread would be an agent that correctly learned a value+policy function that picks out Diamonds in the training scenarios, and which learned a generally-correct concept of Diamonds, but where there are findable edge cases in its concept boundary such that it would count type-X Liemonds as Diamonds if presented with them.
The question, then, is why would it be thinking about how to produce type-X Liemonds in particular? What reason would the agent have to pursue thoughts that it thinks lead to type-X Liemonds over thoughts that it thinks lead to Diamonds? Presumably type-X Liemonds are produced by different means from how Diamonds are produced, since you mentioned they are easier to produce, so thoughts about how to produce one vs. thoughts about how to produce the other will be different. As far as the agent can tell, during training it only ever got reinforcement for Diamonds (the label for the cluster in concept-space), so the concepts that got painted with strong positive valence in its world model, such that it will actively build plans with them as targets, are ones like “Diamond” and “the process that I’ve seen produces Diamonds”. Whereas it has never had a reason to paint concepts like “finding a loophole in my Diamond concept” or “the process that produces type-X Liemonds” with that sort of valence, much less to make them plan targets.
While the “try weird hacks” circuit is getting 0 net reinforcement for firing, the “build Diamonds” circuits are getting positive net reinforcement for firing, so on balance, relative to the other circuits in the agent that steer its thoughts and actions, the “try weird hacks” circuit is continually losing ground, its contribution to decisions diminishing as other circuits get their way more and more.
I think we agree here: As long as you’re updating the value function along with the rest of the agent, this won’t wreck everything. A slightly generalized version of what I was saying there still seems relevant to agents that are continually being updated: When you assign the agent tasks where you can’t label the results, you should still avoid updating any of the agent’s networks. Only updating non-value networks when you’re lacking labels to update the value network would probably still wreck everything, even if the agent will be given more labels in the future.
Okay, I have a completely different idea of what the crux is, so we probably need to figure this out before discussing much more, since this could be the whole reason for the disagreement. I’m definitely not saying the we need to prepare for the agent’s environment to undergo an adversarial distribution shift. The source of the adversarial inputs was always the agent itself, and those inputs are only adversarial from the perspective of the humans who trained the agent, they’re perfectly fine from the agent’s perspective. The distribution shift only reveals holes in the agent’s value function that weren’t possible in the training environment, since all the training environment holes that the agent was able to find got trained out. The agent itself will apply the adversarial pressure to exploit those holes. Does that, by any chance, completely resolve this discussion?
I understand what you mean but I still think it’s incorrect[1]. I think “The agent itself will apply the adversarial pressure to exploit those holes” (emphasis mine) is the key mistake. There are many directions in which the agent could apply optimization pressure and I think we are unfairly privileging the hypothesis that that direction will be towards “exploiting those holes” as opposed to all the other plausible directions, many of which are effectively orthogonal to “exploiting those holes”. I would agree with a version of your claim that said “could apply” but not with this one with “will apply”.
The mere fact that there exist possible inputs that would fall into the “holes” (from our perspective) in the agent’s value function does not mean that the agent will or even wants to try to steer itself towards those inputs rather than towards the non-”hole” high-value inputs. Remember that the trained circuits in the agent are what actually implement the agent’s decision-making, deciding what things to recognize and raise to attention, making choices about what things it will spend time thinking about, holding memories of plan-targets in mind; all based on past experiences & generalizations learned from them. Even though there is one or many
nameless pattern of OOD "hole" input
that would theoretically light up the agent’s value function (toMAX_INT
or some other much-higher-than-desired level), that pattern is not a feature/pattern that the agent has ever actually seen, so its cognitive terrain has never gotten differential updates that were downstream of it, so its cognition doesn’t necessarily flow towards it. The circuits in the agent’s mind aren’t set up to automatically recognize or steer towards the state-action preconditions of that pattern, whereas they are set up to recognize and steer towards the state-action preconditions ofprototypical rewarded input encountered during training
. In my model, that is what happens when an agent robustly “wants” something.I suspect that many other people share your view here, but I think that view is dead-wrong and is possibly at the root of other disagreements, which is why I’m trying so hard to communicate across the gap here.
Okay, cool, it seems like we’re on the same page, at least. So what I expect to happen for AGI is that the planning module will end up being a good general-purpose optimizer: Something that has a good model of the world, and uses it to find ways of increasing the score obtained from the value function. If there is an obvious way of increasing the score, then the planning module can be expected to discover it, and take it.
Scenario: We have managed to train a value function that values Liemonds as well as Diamonds. These both get a high score according to the value function, it doesn’t really distinguish between the two. In your model, the agent as a whole does distinguish between them, so there’s must be somewhere in the agent where the “Diamonds are better than Liemonds” information is stored, even if it’s implicit rather than showing up explicitly in the value function.
It sounds like what you’re saying is that networks related to the planning module stores this information. Rather than being a good general-purpose optimizer, the planning module has some plans that it just won’t consider. This property, of not considering certain plans that might break the value function, is what I referred to as “patches”. As far as I can tell, your model depends on these “patches” generalizing really well, so that all chains of reasoning that could lead to discovering Liemonds are blocked.
I find it fairly implausible that we’ll be able to get “don’t make any plans that break the value function” to generalize any better than the value function itself. If discovering Liemonds is like trying to find one of 1000 needles in a haystack, well finding needles in a haystack is what an optimizer is good at, and it’ll probably discover Liemonds pretty soon. Now if we restrict the allowed reasoning steps so that far fewer of the Liemond plans are thoughts the agent is even capable of thinking, well that’s now like finding one of the 5 remaining needles in the haystack, but we’ve still got an optimizer searching for that needle, and so we’re still in the adversarial situation.
Obviously the system can be made arbitrarily safe by weakening and restricting the optimizer enough, but that also stops our AI from being able to do much of anything.
You ask:
As maybe an intuition pump for this, let’s say there’s a human whose value function puts a high value on saving people’s lives. One strategy this person might think of is: “If I imagine uploading people into a computer simulation, and the simulation is designed to be pleasant and allow everyone uploaded into it to do things together, I think those people are very much count as alive, under my values. Therefore, if I pursue the strategy of helping people freeze their heads for later uploading, I will save lots of lives.”
That sure looks like a weird hack! Especially from the perspective of Evolution, since none of those uploaded humans have any genes at all any more! Why didn’t that human’s reflective decisions about what thoughts to think prevent it from even considering the question “could I upload people?” During that human’s entire training so far, it’s only had to deal with saving the lives of real live humans, with genes and everything. That’s where all its positive feelings about saving lives came from. Why is it suddenly considering this weird dumb thing!?
Looks like we’re popping back up to an earlier thread of the conversation? Curious what your thoughts on the parent comment were, but I will address your latest comment here. :)
I think the AGI will be able to do general-purpose optimization, and that this could be implemented via an internal general-purpose optimization subroutine/module it calls (though, again, I think there’s a flavor of homunculus in this design that I dislike). I don’t see any reason to think of this subroutine as something that itself cares about value function scores, it’s just a generic function that will produce plans for any goal it gets asked to optimize towards. In my model, if there’s a general-purpose optimization subroutine, it takes as an argument the (sub)goal that the agent is currently thinking of optimizing towards and the subroutine spits out some answer, possibly internally making calls to itself as it splits the problem into smaller components.
In this model, it is false that the general-purpose optimization subroutine is freely trying to find ways to increase the value function scores. Reaching states with high value function scores is a side effect of the plans it outputs, but not the object-level goal of its planning. IF the agent were to set “maximize my nominal value function” as the (sub)goal that it inputs to this subroutine, THEN the subroutine will do what you described, and then the agent can decide what to do with those results. But I dispute that this is the default expectation we should have for how the agent will use the subroutine. Heck, you can do general-purpose optimization, and yet you don’t necessarily do that thing. Instead you ask the subroutine to help with the object-level goals that you already know you want, like “plan a route from here to the bar” and “find food to satisfy my current hunger” and “come up with a strategy to get a promotion at work”. The general-purpose optimization subroutine isn’t pulling you around. In fact, sometimes you reject its results outright, for example because the recommended course of action is too unusual or contradicts other goals of yours or is too extreme.
If the system as a whole literally cannot distinguish between Diamonds and Liemonds, even post-training, then yes you can fool it into making Liemonds. But I still don’t see why it would end up making Liemonds over Diamonds.
During training, the agent repeatedly sees these shiny things labeled as “Diamonds”, and keeps having reinforcement tied to getting these objects that it thinks correspond to a “Diamond” concept, and those positive memories build up a whole bunch of neural circuitry that fires in favor of actions and plans tied to the “Diamond” concept in its world model. When the agent goes out into the world, it has an internal goal of “how do I get Diamonds”. Let’s say the agent has no clue about Liemonds yet, so it has no conceptual distinction between Liemonds and Diamonds, so if it saw a Liemond it would gladly accept it because its value function would output a high score, thinking it had received a Diamond.
In order to achieve that internal goal, it can either use the Diamond-acquisition heuristics it learned during training (which, presumably, will not suddenly preferentially lead to Liemonds now, unless the distribution shift was adversarially constructed), or it can reach out into the world (through experimentation, through books, through the web, through humans) and try to figure out how to dereference its “Diamond” conceptual pointer and find out how to get those things. Maybe it directly investigates the English-language word “diamond”, maybe it looks for information about “clear brilliant hard things”, maybe it looks into “things that are associated with weddings in such-and-such way”. Whatever. When it does so, the world gives it a whole bunch of information that will be strongly correlated with how to get Diamonds, but not strongly correlated with how to get Liemonds. In order to find out how to get Liemonds, the agent would have to submit different queries to the world.
But the agent doesn’t have any reason to submit queries that favor Liemonds over Diamonds. At worst, it has reason to submit queries that don’t distinguish between Diamonds and Liemonds at all. At best, it has reason to submit a bunch of queries that cause it to correctly ground-out its previously-ambiguous pointer into the correct concept.
It’s not that the planning module refuses to consider plans that might break the value function. The planning module is free to generate whatever plans it wants for the inputs it gets. It’s that the agent isn’t calling the planning module with “maximize nominal value function scores” as the (sub)goal, so the planning module has no reason to concoct those particular plans in the first place.
Meta-level comment: the space of possible plans is enormous and the world is fractally complex. In order to effectively plan towards nontrivial goals, the agent has to restrict its search in a boatload of ways. I don’t see why restricting the sorts of plans you consider would constitute a “patch” a priori, rather than just counting as “effectively searching for what you want”.
I don’t think this analogy works here. The optimizer isn’t looking for Liemonds specifically; it’s looking for “Diamonds”, a category which initially includes both Diamonds and Liemonds. The more it finds real Diamonds in the haystack, the more its policy and value function get reinforced towards reliably finding and accepting those things (ie Diamonds), & shaped away from finding and accepting Liemonds (and vice versa). Whether this happens depends in part on the relative abundance of the two. Moreover, that relative abundance isn’t fixed like it is with a haystack. The relative abundance of Diamonds and Liemonds in the agent’s search space depends entirely on the policy. If the policy is doing things like “learning how Diamonds are made” or “going to jewelry stores and asking for Diamonds”, then that distribution will have lots of Diamonds and synthetic Diamonds, but not Liemonds. Notice that these factors create a virtuous cycle, where the initial relative abundance of true Diamonds over Liemonds in its trajectories causes the agent to get better at targeting true Diamonds, which makes true Diamonds more abundant in its search space, and so on. This is why I don’t accept whole “eventually it’ll search for and find Liemonds” line of reasoning.
Notice also that all of this happens without weakening the optimizer. The agent wants Diamonds, and it wants to ground its conceptual pointer accurately in the world, keeping its concept in continuity with prototypical Diamonds it got training reinforcement for, and the agent ends up searching for real Diamonds with the full force of its creative and general-purpose search.
Just to clarify the parameters of the thought experiment, Liemonds are specified to be much easier to produce in large quantities than Diamonds, so the score attainable by producing them is many times higher than the maximum possible Diamond score.
The thing that stands out about the holes is that some of them allow the agent to (incorrectly) get an extraordinarily high score. The agent isn’t going to care about holes that allow it to get an incorrectly low score, or a score that is correct, but for weird incorrect reasons, though those kinds of hole will exist too.
It seems like maybe the problem here is that you’re modelling the agent as fairly dumb, certainly dumber than a human level intelligence? Like, if the agent’s version of “how do I decide what to do next” is based entirely off of things like “do what worked before”, and doesn’t involve doing any actual original reasoning, then yeah, it’s probably not going to think of making Liemonds. Adversarial robustness is much easier if your adversary is not too smart.
I’m generally modelling the agent as being more intelligent: close to human if not greater than human. I generally expect that something this smart would think of “Liemonds”, in the same way that a human might think of “save people’s lives by uploading them” in my example above.
I get that. In addition, Liemonds and Diamonds are in reality different objects that require different processes to acquire, right? Like, even though Liemonds are easier to produce in large quantities if that’s what you’re going for, you won’t automatically produce Liemonds on the route to producing Diamonds. If you’re trying to produce Diamonds, you can end up accidentally producing other junk by failing at diamond manufacturing, but you won’t accidentally produce Liemonds. So unless you are intentionally trying to produce Liemonds, say as an instrumental route to “produce the maximum possible Diamond score”, you won’t produce them.
It sounds like the reason you think the agent will intentionally produce Liemonds is as a instrumental route to getting the maximum possible Diamond score. I agree that that would be a great way to produce such a score. But AFAICT getting the maximum possible Diamond score is not “what the agent wants” in general. Reward is not the optimization target, and neither is the value function. Agents use a value function, but the agent’s goals =/= maximal scores from the value function. The value function aggregates information from the current agent state to forecast the reward signal. It’s not (inherently) a goal, an intention, a desire, or an objective. The agent could use high value function scores as a target in planning if it wanted to, but it could use anything it wants as a target in planning, the value function isn’t special in that regard. I expect that agents will use planning towards many different goals, and subgoals of those goals, and so on, with the highest level goals being about the concepts in the world model, not the literal outputs of the value function. I suspect you disagree with this and that this is a crux.
No, I am modeling the agent as being quite intelligent, at least as intelligent as a human. I just think it deploys that inteligence in service of a different motivational structure than you do.
Yeah, agree that reward is not the optimization target. Otherwise, the agent would just produce diamonds, since that’s what the rewards are actually given out for (or seize the reward channel, but we’ll ignore that for now). I’m a lot less sure that the value function is not the optimization target. Ignoring other architectures for the moment, consider a design where the agent has a value function and a world model, uses Monte-Carlo tree search, and picks the action that gives the highest expected score according to the value function. In that case, I’m pretty comfortable saying, “yeah, that particular type of agent is in fact optimizing for its value function”. Would you agree (just for that specific agent design)?
I think you’ve identified the crux correctly there, with the caveat in my position that the value function doesn’t necessarily have to be labelled “value function” on the blueprint.
Maybe the single most helpful thing would be if you just described the agent you have in mind as doing all these things in as much detail a possible. Like, the best thing would be on a level of describing what all the networks in the agent are, how they’re connected, and the gist of what they’re tuned to achieve during training. I’ll take whatever you can provide in terms of detail, though. Also, feel free to reduce down to the simplest agent that displays the robustness properties you’re describing here.
I’m fine with describing that design like that. Though I expect we’d need a policy or some other method of proposing actions for the world model/MCTS to evaluate, or else we haven’t really specified the design of how the agent makes decisions.
Hmm. I wasn’t imagining that any particularly exotic design choices were needed for my statements to hold, since I’ve mostly been arguing against things being required. What robustness properties are you asking about?
A shot at the diamond alignment problem is probably a good place to start, if you’re after a description of how the training process and internal cognition could work along a similar outline to what I was describing.
Sorry for the slow response, lots to read through and I’ve been kind of busy. Which of the following would you say most closely matches your model of how diamond alignment with shards works?
The diamond abstraction doesn’t have any holes in it where things like Liemonds could fit in, due to the natural abstraction hypothesis. The training process is able to find exactly this abstraction and include it in the agent’s world model. The diamond shard just points to the abstraction in the world model, and thus also has no holes.
Shards form a kind of vector space of desire. Rather than thinking of shards as distinct circuits, we should think of them as different amounts of pull in different directions. An agent that would pursue both diamonds and liemonds should be thought of as a linear combination of a diamond shard and a liemond shard. Thus, it makes sense to refer to a shard that points exactly in the direction of diamonds, with no liemond component. The liemond component might also be present in the agent, but we can conceptualize it as a separate shard.
Shards can be imperfect, the above two theories trying to make them perfect are silly. Shards aren’t like a value function that looks at the world and assigns a value. Instead...
a. …the agent is set up so that no part of the agent is “trying to make the shard happy”. (I think I’d still need a diagram that showed gradient flow here so I could satisfy myself that this was the case.)
b. …shards make bids on what the world should look like, using an internal language that is spoken by the world model. A diamond shard is just something that knows how to say “diamond” in this internal language.
c. …plans are generated by a “plausible plan generator” and then shards bid on those plans by the amount that they expect to be satisfied. Somehow the plausible plan generator is able to avoid “hacking” the shards by generating plans that they will incorrectly label as good.
None of the above.
3 is the closest. I don’t even know what it would mean for a shard to be “perfect”. I have a concept of diamonds in my world model, and shards attached to that concept. That concept includes some central features like hardness, and some peripheral features like associations with engagement rings. That concept doesn’t include everything about diamonds, though, and it probably includes some misconceptions and misgeneralizations. I could certainly be fooled in certain circumstances into accepting a fake diamond, for ex. if a seemingly-reputable jewelry store told me it was real.
But this isn’t an obstacle to me liking & acquiring diamonds generally, because my “imperfect” diamond concept is nonetheless still a pointer grounded in the real world, a pointer that has been optimized (by myself and by reality) to accurately-enough track diamonds in the scenarios I’ve found myself in. That’s good enough to hang a diamond-shard off of. Maybe that shard fires more for diamonds arranged in a circle than for diamonds arranged in a triangle. Is that an imperfection? I dunno, I think there are many different ways of wanting a thing, and I don’t think we need perfect wanting, if that’s even a thing.
Notice that there is a giant difference between “try to get diamonds” and “try to get the diamond-shard to be happy”, analogous to the difference between “try to make a million bucks” and “try to convince yourself you made a million bucks”. If I wanted to generate a plan to convince myself I’d made a million bucks, my plan-generator could, but I don’t want to, because that isn’t a strategy I expect to help me get the things I want, like a million bucks. My shards shape what plans I execute, including plans about how I should do planning. The shard is the thing doing optimization in conjunction with the rest of the agent, not the thing being optimized against. If there was a part of you trying to make the diamond shard happy, that’d be, like, a diamond-shard-shard (latched onto the concept of “my diamond shard” in your ontology).
Cool, thanks for the reply, sounds like maybe a combination of 3a and the aspect of 1 where the shard points to a part of the world model? If no part of the agent is having its weights tuned to choose plans that make a shard happy, where would you say a shard mostly lives in an agent? World model? Somewhere else? Spread across multiple components? (At the bottom of this comment, I propose a different agent architecture that we can use to discuss this that I think fairly naturally matches the way you’ve been talking about shards.)
My model doesn’t predict that most agents will try to execute “fool myself into thinking I have a million bucks” style plans. If you think my model predicts that, then maybe this is an opportunity to make progress?
In my model, the agent is actually allowed to care about actual world states and not just its own internal activations. Consider two ways the agent could “fool” itself into thinking it had a million bucks:
Firstly, it could tamper with its own mind to create that impression. This could be either through hacking, or carefully spoofing its own sensory inputs. When planning, the predicted future states of the world are going to correctly show that the agent is fooling itself. So when the value function is fed the predicted world states, it’s going to rate them as bad, since in those world states, the agent does not have a million bucks. It doesn’t matter to the agent that later, after being hacked, the value function will think the agent has a million bucks. Right now, during planning, the value function isn’t fooled.
Secondly, it could create a million counterfeit bucks. Due to inaccuracies in training, maybe the agent actually thinks that having counterfeit money is just as good as real money. I.e. the value function does actually rate counterfeit bucks higher than real bucks. If so, then the agent is going to be perfectly satisfied with itself for coming up with this clever idea for satisfying its true values. The humans who were training the agent and wanted a million actual dollars won’t be satisfied, but that’s their problem, not the agent’s problem.
Okay, so here’s another possible agent design that we might be able to discuss: There’s a detailed world model that can answer various probabilistic queries about the future. Planning is done by formulating a query to the world model in the following way: Sample histories h according to:
P(h)=exp(U(h)/T)∑h′exp(U(h′)/T)
Where U(h) is utility assigned by the agent to a given history, and T is some small number which is analogous to temperature. Histories include the action taken by the agent, so we can sample with the action undetermined, and then actually take whichever action happened in the sampled history. (For simplicity, I’m assuming that the agent only takes one action per episode.)
To make this a “shardy” agent, we’d presumably have to replace U with something made out of shards. As long as it shows up like that in the equation for P(h), though, it looks like we’d have to make U adversarially robust. I’d be interested in what other modifications you’d want to make to this agent in order to make it properly shardy. This agent design does have the advantage that it seems like values are more directly related to the world model, since they’re able to directly examine h.
I’d say the shards live in the policy, basically, though these are all leaky abstractions. A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there’s a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.
I brought this up because I interpreted your previous comment as expressing skepticism that
Whereas I think that it will be true for analogous reasons as the reasons that explain why no part of the agent is “trying to make itself believe it has a million bucks”.
I have a vague feeling that the “value function map = agent’s true values” bit of this is part of the crux we’re disagreeing about.
Putting that aside, for this to happen, it has to be simultaneously true that the agent’s world model knows about and thinks about counterfeit money in particular (or else it won’t be able to construct viable plans that produce counterfeit money) while its value function does not know or think about counterfeit money in particular. It also has to be true that the agent tends to generate plans towards counterfeit money over plans towards real money, or else it’ll pick a real money plan it generates before it has had a chance to entertain a counterfeit money plan.
But during training, the plan generator was trained to generate plans that lead to real money. And the agent’s world model / plan generator knows (or at least thinks) that those plans were towards real money, even if its value function doesn’t know. This is because it takes different steps to acquire counterfeit money than to acquire real money. If the plan generator was optimized based on the training environment, and the agent was rewarded there for doing the sorts of things that lead to acquiring real money (which are different from the things that lead to counterfeit money), then those are the sorts of plans it will tend to generate. So why are we hypothesizing that the agent will tend to produce and choose the kinds of counterfeit money plans its value function would “fail” (from our perspective) on after training?
IMO an agent can’t work this way, at least not an embedded one. It would need to know what utility it would assign to each history in the distribution in order to sample proportional to the exponential of that history’s utility. But before it has sampled that history, it has not evaluated that history yet, so it doesn’t yet know what utility it assigns, so it can’t sample from such a distribution.
At best, the agent could sample from a learned history generator, one tuned on previous good histories, and then evaluate some number of possibilities from that distribution, picking one that’s good according to its evaluation. But that doesn’t require adversarial robustness, because the history generator will tend strongly to generate possibilities like the ones that evaluated/worked well in the past, which is exactly where the evaluations will tend to be fairly accurate. And the better the generator, the less options need to be sampled, so the less you’re “implicitly optimizing against” the evaluations.
Thanks for describing this. Technical question about this design: How are you getting the gradients that feed backwards into the action head? I assume it’s not supervised learning where it’s just trying to predict which action a human would take?
I’m aware of the issue of embedded agency, I just didn’t think it was relevant here. In this case, we can just assume that the world looks fairly Cartesian to the agent. The agent makes one decision (though possibly one from an exponentially large decision space) then shuts down and loses all its state. The record of the agent’s decision process in the future history of the world just shows up as thermal noise, and it’s unreasonable to expect the agent’s world model to account for thermal noise as anything other than a random variable. As a Cartesian hack, we can specify a probability distribution over actions for the world model to use when sampling. So for our particular query, we can specify a uniform distribution across all actions. Then in reality, the actual distribution over actions will be biased towards certain actions over others because they’re likely to result in higher utility.
This seems to be phrased in a weird way that rules out creative thinking. Nicola Tesla didn’t have three phase motors in his world-model before he invented them, but he was able to come up with them (his mind was able to generate a “three phase motor” plan) anyways. The key thing isn’t having a certain concept already existing in your world model because of prior experience. The requirement is just that the world model is able to reason about the thing. Nicola Tesla knew enough E&M to reason about three phase motors, and I expect that smart AIs will have world models that can easily reason about counterfeit money.
The job of a value function isn’t to know or think about things. It just gives either big numbers or small numbers when fed certain world states. The value function in question here gives a big number when you feed it a world state containing lots of counterfeit money. Does this mean it knows about counterfeit money? Maybe, but it doesn’t really matter.
A more relevant question is whether the plan-proposer knows the value function well enough that it knows the value function will give it points for suggesting the plan of producing counterfeit money. One could say probably yes, since it’s been getting all its gradients directly from the value function, or probably no, since there were no examples with counterfeit money in the training data. I’d say yes, but I’m guessing that this issue isn’t your crux, so I’ll only elaborate if asked.
This sounds like you’re talking about a dumb agent; smart agents generate and compare multiple plans and don’t just go with the first plan they think of.
Generalization. For a general agent, thinking about counterfeit money plans isn’t that much different than thinking of plans to make money. Like if the agent is able to think of money making plans like starting a restaurant, or working as a programmer, or opening a printing shop, then it should also be able to think of a counterfeit money making plan. (That plan is probably quite similar to the printing shop plan.)
Could be from rewards or other “external” feedback, could be from TD/bootstrapped errors, could be from an imitation loss or something else. The base case is probably just a plain ol’ rewards that get backpropagated through the action head via policy gradients.
Sorry for being unclear, I think you’re talking about a different dimension of embeddness than what I was pointing at. I was talking about the issue of logical uncertainty: that the agent needs to actually run computation in order to figure out certain things. The agent can’t magically sample from
P(h)
proportional toexp(U(h))
, because it needs theexp(U(h'))
of all the other histories first in order to weigh the distribution that way, which requires having already sampledh'
and having already calculatedU(h')
. But we are talking about how it samples a historyh
in the first place! The “At best” comment was proposing an alternative that might work, where the agent samples from a prior that’s been tuned based onU(h)
. Notice, though, that “our sampling is biased towards certain histories over others because they resulted in higher utility” does not imply “if a history would result in higher utility, then our sampling will bias towards it”.Consider a parallel situation: sampling images and picking one that gets the highest score on a non-robust face classifier. If we were able to sample from the distribution of images proportional to their (exp) face classifier scores, then we would need to worry a lot about picking an image that’s an adversarial example to our face classifier, because those can have absurdly high scores. But instead we need to sample images from the prior of a generative model like FFHQ StyleGAN2-ADA or DDPM, and score those images. A generative model like that will tend strongly to convert whatever input entropy you give it into a natural-looking image, so we can sample & filter from it a ton without worrying much about adversarial robustness. Even if you sample 10K images and pick the 5 with the highest face classifier scores, I am betting that the resulting images will still really be of faces, rather than classifier-adversarial examples. It is true that “our top-5 images are biased towards some images over others because they had higher face classification scores” but it is not true that “if adversarial examples would have higher face classification scores than regular images, then our sampling will bias towards adversarial examples.”
I disagree with this? The job of the value function is to look at the agent state (what the agent knows about the world) and estimate what the return will be based on that state & its implications. This involves “knowing things” and “thinking about things”. If the agent has a model
T(s, a, s')
that represents some state feature like “working a job” or “building a money-printer”, then that feature should exist in the state passed to the value functionV(s)
/Q(s, a)
as well, such that the value function will “know” about it and incorporate it into its estimation process.But it sounds like in the imagined scenario, the agent’s model and policy are sensitive to a bunch of stuff that the value function is blind to, which makes this configuration seem quite weird to me. If the value function was not blind to those features, then as the model goes from the training environment, where it got returns based on “getting paid money for tasks” (or whatever we rewarded for there) to the deployment environment, where the action space is even bigger, both the model/policy and the value function “generalize” and learn motivationally-relevant new facts that inform it what “getting paid money for tasks” looks like.
The plan-proposer isn’t trying to make the value function give it max points, it’s trying to [plan towards the current subgoal], which in this case is “make a million bucks”. The plan-proposer gets gradients/feedback from the value function, in the form of TD upates that tell it “this thought was current-subgoal-better/worse than expected upon consideration, do more/less of that in mental contexts like this”. But a thought like “Inspect my own value function to see what it’d give me positive TD updates for” evaluates as current-subgoal-worse for the subgoal of “make a million bucks” than a more actually-subgoal-relevant thought like “Think about what tasks are available nearby”, so the advantage is negative, which means a negative TD update, causing that thought to be progressively abandoned.
Once again, I agree that the agent is smart, and generalizes to be able to think of such a plan, and even plans much more sophisticated than that. I am arguing that it won’t particularly want to produce/execute those plans, unless its motivations & other cognition is biased in a very specific way.
Thanks for the reply. Just to prevent us from spinning our wheels too much, I’m going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we’re thinking of agents that work in different ways when making our points.
PolicyGradientBot: Defined by the following description:
ThermodynamicBot: Defined by the following description:
P(h)=exp(U(h)/T)∑h′exp(U(h')/T)
Comments on ThermodynamicBot
This bot is of course a bounded agent, and so the world model can’t be perfect, but consider the following steps:
For each possible h, compute U(h) and exp(U(h)/T).
Compute the sum Z=∑h′exp(U(h′)/T)
Now we know the probability for any given history h: It’s exp(U(h)/T)/Z.
This is a finite sequence of computational steps that terminates without self-reference, so no logical induction is needed here. Now you may fairly object that there is still an issue of computational complexity. The space of histories is exponentially large, so in practice the computation couldn’t be completed in time. This is the known-to-be-hard problem of computing the partition function. But the problem is tractable in many special cases, and humans get by well enough in our own reasoning about a world full of combinatorial explosions. We can suppose that, at the cost of making itself even more of an approximation, the world model has a way to efficiently sample from the distribution, even given the difficulty of computing Z. To take one particular concrete way this could be implemented, if the world model is a factor graph with few or no loops, then we can do the computation by adding on a few factors to account for exp(U(h)/T) and then using belief propagation to solve it.
I agree with your predictions for what would happen here. ThermodynamicBot isn’t really a GAN, though. If we tried to train a GAN to sub-in for ThermodynamicBot’s world model, then we could do it two ways (2nd way is most similar to your proposal). In both cases, the generator produces candidate histories, h.
Discriminator tries to predict what’s real (as usual), but generator rewards are shifted by U(h)/T to encourage it to produce outputs that are properly utility-weighted. (For simplicity, we can suppose that U(h) is differentiable so that gradients can flow back through it to the generator.) I expect this results in the generator producing adversarial examples if U is not adversarially robust.
Discriminator tries to predict what’s real and generator rewards are not shifted. Instead, we sample 10k histories from the generator and weight them all by utility. Or even just go full argmax and choose the history with the highest utility. 10k is not that large, it gives the agent about 14 bits of optimization power. In general, if it takes n bits of optimization power on the latent vector to convince the generator to construct an adversarial example, then we can safely use this scheme to provide slightly less than n bits of useful optimization. But we could do that multiple times, so maybe it’s a bit more useful than that. Also, maybe n is super large so it’s fine in practice?
Comments on PolicyGradientBot
So, the standard criticism of policy gradient is that it’s noisy and doesn’t really allow for credit assignment. In particular, I think the lack of credit assignment is really a crucial flaw that will prevent policy gradient from ever being used to create an AGI. As far as I know, no agent powered purely by policy gradient does anything particularly impressive, though I could be wrong about that. Do you have any arguments / ideas for how policy gradient could be made to work here?
General Clarifications of my Position
Just to clarify, my model doesn’t predict that the AI will use introspection on its own value function, or even look at its own source code at all. Some AIs may be designed to do that, but it’s not required for the failure case I’m considering. If gradients are flowing backwards from the value function to the actor (Bot-specific statement alert: Actor-Critic bots) then the actor has probably absorbed a significant amount of information about the value function’s misalignment. I don’t think it needs to take any additional steps of inspecting the agent’s own weights, or anything like that. You may object that gradients need not necessarily flow backwards there. After all, policy gradient and temporal difference learning are a thing. Let’s join the thread of discussion where you make that objection onto the PolicyGradientBot discussion.
Also, my model doesn’t predict that the agent’s subgoal reasoning will suddenly go off the rails and fail to achieve the subgoal in question because the agent was too busy thinking about how to counterfeit money. If you’re factoring the agent’s reasoning into subgoals, you have to remember that there’s a factor that actually sets the subgoals, and that’s where there’s the potential to go off the rails and start considering subgoals like “print lots of counterfeit money”. Obviously in the context where the agent is already considering the “get paid money for performing tasks” subgoal, the agent’s reasoning isn’t going to get screwed up.
Sounds good.
Comments on ThermodynamicBot
If we assume that the agent is making decisions by (approximately) plugging in every possible
h
intoU(h)
and picking based on (the partition function derived from) that, then of course you needU(h)
to be adversarially robust! I disagree with that as a model of how planning works or should work. IMO, not only is “plug every possibleh
intoU(h)
” extremely computationally infeasible, but even if it were feasible it would be a forseeably-broken (because fragile) planning strategy.Quote from a comment of TurnTrout about argmax planning, though I think it also applies to ThermodynamicBot, since that just does a softened version of argmax planning (converging to argmax planning as
T
->0):I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.
Right, I wasn’t thinking of it as actually a GAN, just giving an analogy where similar causal patterns are in play, to make my point clearer. But yeah, if we wanted to actually use a GAN, your suggestions sound reasonable.
Comments on PolicyGradientBot & General Position
I guess, but I’m confused why we’re talking about competitiveness all of a sudden. I mean, variants on policy gradient algorithms (PPO, Actor-Critic, etc.) do some impressive things (at least to the extent any RL algorithms currently do impressive things). And I can imagine more sophisticated versions of even plain policy gradient that would do impressive things, if the action space includes mental actions like “sample a rollout from my world model”.
But that’s neither here nor there IMO. In the previous comment, I tried to be clear that it makes me ~no difference where the gradients come from, when I said
Because I think that the outline of my argument doesn’t depend on one specific RL algorithm.
Consider ActorCriticBot. I would make the same argument for it as I did in my previous comment, that the actor is optimized by the critic’s outputs but that the actor is not optimizing for critic outputs. It does not particularly matter that
Critic("I made counterfeit money") >= Critic("I got money for doing the task")
or thatCritic("Extremely out-of-distribution state that Critic happens to evaluate super highly") >> Critic("I got money for doing the task")
, because the actor in actor-critic doesn’t make its decisions by running an internalCriticEstimator(plan)
and doing whatever evaluates best. It makes its decisions by looking at the current state and weighing the decision-factors triggered by that state; decision-factors that were internalized because they were upstream of actual positive feedback it received in the past from the environment+critic.In the training distribution, the agent never reached a state like
"I made counterfeit money"
, so the critic never gave feedback from that “misaligned” portion of the state space, so the actor never got gradients from the critic that differentially upweighted the actor’s concern for counterfeit money-specific factors, so the actor never internalized the particular antecedents of counterfeit money as motivating, so actor doesn’t factor them into its decision-making. Whereas the actor does factor things like “Am I doing the task yet?” into its decisions, because the actor internalized those as antecedents of reward/value, because the actor got gradients from the critic that differentially upweighted the agent’s concern those factors, because those were part of the state space covered by the training distribution, because the agent actually reached states where it got paid for e.g. task-completion.Comments on ThermodynamicBot
To be clear, I’m not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that’s a tree. Call this ThermodynamicBot-F. You could also imagine the role of “world model” being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBot-N.
Yes, I understand that running a search that will kill you if it succeeds is dumb. This has been known for many years. The question is how do we actually write a program to do a sane search? You quote TurnTrout:
I don’t find this particularly helpful. If we know which plans are adversarial so we can eliminate them from the search space, we’re already half way to solving alignment. I don’t think the plans a bounded agent is going to eliminate so that it can finish its thinking on time are automatically going to be the adversarial ones. I think this is a problem that is going to take actual effort.
In particular for ThermodynamicBot:
Case where the world model is implemented in a factor graph (ThermodynamicBot-F): This gives exactly the same result as searching across all inputs, but the computation is efficient, and not really wasteful in any sense. If we imagine trying to “improve” the belief propagation algorithm to simultaneously make it more efficient and also remove some subset of plans it’s searching over that are “adversarial”, I can’t really imagine a way to do that, and it would certainly make the algorithm more complicated and less elegant.
Case where a neural network world model is being used (ThermodynamicBot-N): In this case there are likely plans that will be missed by ThermodynamicBot-N because of the bounded nature of its world model, even though they would be found by searching across all inputs. But if we imagine training the world model to make it better, I would generally expect this to increase the world model’s ability to find adversarial plans just like it increases its ability to find good plans. In general, I don’t expect there to be any correlation where all the adversarial plans happen to be eliminated due to bounded reasoning. Why should we be so lucky that all the errors we’re making happen to cancel each other out?
I agree if we’re literally talking about brute force search here. If we’re talking about the more realistic ThermodynamicBot designs I’ve mentioned, then I’m not sure I agree. In some sense, all methods an agent could use to plan are “picking plans from plan-space that are better than most other plans”. Even ActorCriticBot is “trying” to approximate argmax. If we could train it to minimal loss, it would be an ArgMaxBot. Is there some particular approximation or heuristic that we can adopt, where if we do adopt it we go from dangerously approaching ArgMaxBot to safely searching through only good plans? An approximation used by ActorCriticBot, but not by ThermodynamicBot-N? If so, I have no idea what the crucial approximation is that you could be thinking of.
I also don’t think it’s at all obvious that ThermodynamicBot designs are necessarily capability-limited. It makes a lot of sense to integrate planning very closely with the world model. Might be worth betting on the direction of future RL research here if we can set sufficiently objective resolution criteria? In any case, I do think this counts as some progress in this discussion, since we’ve found an example of an agent that we both agree your argument doesn’t apply to.
Comments on PolicyGradientBot vs ActorCriticBot
In my view, there’s kind of a huge gulf between PolicyGradientBot and ActorCriticBot, where the gradients flowing backwards into ActorCriticBot’s actor end up carrying a lot of information. This allows for much better performance, and in particular much better sample efficiency, at the cost that some of the information is about weaknesses in ActorCriticBot’s critic.
To take a particular example, if the critic overvalues blue diamonds, then gradients flowing into the actor are going to be steeper for actions that obtain blue diamonds. Then in a new environment where there’s a bucket of blue paint sitting in the corner, it seems reasonable to expect that the actor might try to use that bucket to paint diamonds blue, at least assuming it’s sufficiently intelligent and flexible.
For PolicyGradientBot on the other hand, while it could still result in alignment failures, it seems much more like we’re just directly training a policy. But PolicyGradientBot is very slow when it comes to sample efficiency.
WRT other algorithms like temporal difference learning that lie kind of in between PolicyGradientBot and ActorCriticBot, I think the question of what happens for ActorCriticBot is already a crux in this discussion, but feel free to add more bot types if you think it would be useful.
Is ActorCriticBot robust?
Again, I’m not saying a brute force search over plans is being done here, but I’d generally expect that what the actor is doing is very strongly linked to what the critic values, and I’d say it’s very likely that the Actor has lots of components inside of it roughly related to the question “what is the critic going to think about this situation?” For example, if the critic consistently overvalues blue, then I’d predict that the actor has lots of circuits inside of it related to blueness. Do you disagree with this?
Obviously the actor’s ideas of what’s good aren’t going to be perfectly faithful to the critic: There will exist some adversarial plans that the actor just isn’t going to generate, but again the question is: Why should we be so lucky that the errors we’re making exactly cancel out? I don’t see any reason to expect that the actor’s imperfect approximation of the critic and critic’s imperfect approximation of our true desires should cancel out so well that the actor never generates any adversarial plans at all.