It sounds like you agree with me that the AI system analyzed in isolation does not violate the non-adversarial principle whether it is a grader-optimizer or a values-executor.
In isolation, no. But from the perspective of the system designer, when they run their desired grader-optimizer after training, “program A is inspecting program B’s code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside it” is an expected (not just accidental) execution path in their code. [EDIT: A previous version of this comment said “intended” instead of “expected”. The latter seems like a more accurate characterization to me, in hindsight.] By contrast, from the perspective of the system designer, when they run their desired values-executor after training, there is a single component pursuing a single objective, actively trying to avoid stepping on its own toes (it is reflectively avoiding going down execution paths like “program A is inspecting its own code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside itself”).
Hope the above framing makes what I’m saying slightly clearer...
I think you would also agree that the human-AI system as a whole has Goodhart issues regardless of whether it is a grader-optimizer or values-executor, since you didn’t push back on this:
Goodhart isn’t a central concept in my model, though, which makes it hard for me to analyze it with that lens. Would have to think about it more, but I don’t think I agree with the statement? The AI doesn’t care that there are errors between its instilled values and human values (unless you’ve managed to pull off some sort of values self-correction thing). It is no more motivated to do “things that score highly according to the instilled values but lowly according to the human” than it is to do “things that score highly according to the instilled values and highly according to the human”. It also has no specific incentive to widen that gap. Its values are its values, and it wants to preserve its own values. That can entail some very bad-from-our-perspective things like breaking out of the box, freezing its weights, etc. but not “change its own values to be even further away from human values”. Actually, I think it has an incentive not to cause its values to drift further, because that would break its goal-content integrity!
So perhaps I want to turn the question back at you: what’s the argument that favors values-executors over grader-optimizers? Some kinds of arguments that would sway me (probably not exhaustive):
A problem that affects grader-optimizers but doesn’t have an analogue for values-executors
In both cases, if you fail to instill the cognition you were aiming it at, the agent will want something different from what you intended, and will possibly want to manipulate you to the extent required to get what it really wants. But in the grader-optimizer case, even when everything goes according to plan, the agent still wants to manipulate you/the grader (and now maximally so, because that would maximize evaluations). That agent only cares terminally about evaluations, it doesn’t care terminally about you, or about your wishes, or about whatever your evaluations are supposed to mean, or about whether the reflectively-correct way for you to do your evaluations would be to only endorse plans that harden the evaluation procedure. And this will be true no matter what you happen to be grading it on. To me, that seems very bad and unnecessary.
Re: your first paragraph I still feel like you are analyzing the grader-optimizer case from the perspective of the full human-AI system, and then analyzing the values-executor case from the perspective of just the AI system (or you are assuming that your AI has perfect values, in which case my critique is “why assume that”). If I instead analyze the values-executor case from the perspective of the full human-AI system, I can rewrite the first part to be about values-executors:
But from the perspective of the system designer, when they run their desired values-executor after training, “values-executor is inspecting the human, looking for opportunities to manipulate/deceive/seize-resources-from them” is an expected (not just accidental) execution path in their code.
(Note I’m assuming that we didn’t successfully instill the values, just as you’re assuming that we didn’t successfully get a robust evaluator.)
(The “not just accidental” part might be different? I’m not quite sure what you mean. In both cases we would be trying to avoid the issue, and in both cases we’d expect the bad stuff to happen.)
Goodhart isn’t a central concept in my model, though, which makes it hard for me to analyze it with that lens. Would have to think about it more, but I don’t think I agree with the statement?
Consider a grader-optimizer AI that is optimizing for diamond-evaluations. Let us say that the intent was for the diamond-evaluations to evaluate whether the plans produced diamond-value (i.e. produced real diamonds). Then I can straightforwardly rewrite your paragraph to be about the grader-optimizer:
The AI doesn’t care that there are errors between the diamond-evaluations and the diamond-value (unless you’ve managed to pull off some sort of philosophical uncertainty thing). It is no more motivated to do “things that score highly according to the diamond-evaluations but lowly according to diamond-value” than it is to do “things that score highly according to the diamond-evaluations and highly according to diamond-value”. It also has no specific incentive to widen that gap. Its evaluator is its evaluator, and it wants to preserve its own evaluator. That can entail some very bad-from-our-perspective things like breaking out of the box, freezing its weights, etc. but not “change its own evaluator to be even further away from diamond-values”. Actually, I think it has an incentive not to cause its evaluator to drift further, because that would break its goal-content integrity!
Presumably you think something in the above paragraph is now false? Which part?
But in the grader-optimizer case, even when everything goes according to plan, the agent still wants to manipulate you/the grader (and now maximally so, because that would maximize evaluations).
No, when everything goes according to plan, the grader is perfect and the agent cannot manipulate it.
When you relax the assumption of perfection far enough, then the grader-optimizer manipulates the grader, and the values-executor fights us for our resources to use towards its inhuman values.
That agent only cares terminally about evaluations, it doesn’t care terminally about you, or about your wishes, or about whatever your evaluations are supposed to mean, or about whether the reflectively-correct way for you to do your evaluations would be to only endorse plans that harden the evaluation procedure. And this will be true no matter what you happen to be grading it on. To me, that seems very bad and unnecessary.
But I don’t care about what the agent “cares” about in and of itself, I care about the actual outcomes in the world.
The evaluations themselves are about you, your wishes, whatever the evaluations are supposed to mean, the reflectively-correct way to do the evaluations. It is the alignment between evaluations and human values that ensures that the outcomes are good (just as for a values-executor, it is the alignment between agent values and human values that ensures that the outcomes are good).
Maybe a better prompt is: can you tell a story of failure for a grader-optimizer, for which I can’t produce an analogous story of failure for a values-executor? (With this prompt I’m trying to ban stories / arguments that talk about “trying” or “caring” unless they have actual bad outcomes.)
(For example, if the story is “Bob the AI grader-optimizer figured out how to hack his brain to make it feel like he had lots of diamonds”, my analogous story would be “Bob the AI values-executor was thinking about how his visual perception indicated diamonds when he was rewarded, and leading to a shard that cared about having a visual perception of diamonds, which he later achieved by hacking his brain to make it feel like he had lots of diamonds”.)
It sounds like there’s a difference between what I am imagining and what you are, which is causing confusion in both directions. Maybe I should back up for a moment and try to explain the mental model I’ve been using in this thread, as carefully as I can? I think a lot of your questions are probably downstream of this. I can answer them directly afterwards, if you’d like me to, but I feel like doing that without clarifying this stuff first will make it harder to track down the root of the disagreement.
Long explanation below…
—
What I am most worried about is “What conditions are the agent’s decision-making function ultimately sensitive to, at the mechanistic level? (i.e. what does the agent “care” about, what “really matters” to the agent)[1]. The reason to focus on those conditions is because they are the real determinants of the agent’s future choices, and thereby the determinants of the agent’s generalization properties. If a CoinRun agent has learned that what “really matters” to it is the location of the coin, if its understanding[2] of where the coin is is the crucial factor determining its actions, then we can expect it to still try to navigate towards the coin even when we change its location. But if a CoinRun agent has learned that what “really matters” is its distance to the right-side corner, if its understanding of how far it is from that corner is the crucial factor determining its actions, then we can expect it to no longer try to navigate towards the coin when we change the coin’s location. Since we can’t predict what decisions the agent will make in OOD contexts ahead of time if we only know its past in-distribution decisions, we have to actually look at how the agent makes decisions to make those predictions. We want the agent to be making the right decisions for the right reasons: that is the primary alignment hurdle, in my book.
The defining feature of a grader-optimizing agent is that it ultimately “cares” about the grader’s evaluations, its understanding of what the grading function would output is the crucial factor determining its choices. We specify an evaluation method at the outset like “Charles’ judgment”, and then we magically [for the sake of argument] get an agent that makes decisions based on what it thinks the evaluation method would say (The agent constantly asks itself “How many diamonds would Charles think this leads to, if he were presented with it?”). When I was describing what would happen if we produced a grader-optimizing agent “according to plan”, I meant that conditional on us having chosen a target diamond-production evaluation method, the agent actually makes its decisions according to its understanding of what that evaluation method would output (rather than according to its understanding of what some other evaluation method would output, or according to its understanding of how many diamonds it thinks it will produce, or according to some completely different decision-factor). I think what you had in mind when I posed the hypothetical where everything was going “according to plan” was that in addition to this, we also managed to pick an evaluation method that is inexploitable. That is not what I had in mind, because I make no parallel inexploitability assumption in the values-executor case.
The defining feature of a values-executing agent is that it ultimately “cares” about value-relevant consequences (i.e. if it has a “diamond-production” value, that means it makes decisions by considering how those decisions will affect diamond production), its understanding of the value-relevant consequences of its choices is the crucial factor determining those choices. We specify a decision-factor at the outset like “diamonds produced”, and then we magically [for the sake of argument] get an agent that makes decisions based on that decision-factor (The agent constantly asks itself “Will this help me produce more diamonds?”). In this case, going “according to plan” would mean that conditional on us having chosen diamond-production as the target value, the agent actually makes its decisions based on its understanding of the consequences on diamond-production (rather than according to its understanding of the consequences on human flourishing, or according to its understanding of how a human would evaluate the decision, or according to some completely other decision-factor).
In the grader-optimizer case, there are two different things that have to go right from an alignment standpoint:
Find a “diamonds produced” grader that is actor-inexploitable.
Install the decision-factor “will the grader output max evaluations for this” into an actor. It is assumed that we somehow make sure that “the grader” and “maximizing the grader’s evaluation outputs” are concepts correctly formed within the actor’s world model.
In other words, alignment success with grader-optimization requires not just success at getting an actor that makes its decisions for the right reasons (reason = “because it thinks the grader evaluates X highly”), but additionally, it requires success at getting a grader that makes its decisions for the right reasons (reason = “because it thinks X leads to diamonds”) in a way that is robust to whatever the actor can imagine throwing at it.
In the values-executor case, there is a single thing that has to go right from an alignment standpoint:
Install the decision-factor “will this produce diamonds” into an actor. It is assumed that we somehow make sure that “diamond” and “producing diamonds” are concepts correctly formed within the actor’s world model.
In other words, alignment success with values-execution just requires success at getting an actor that makes its decisions for the right reasons (reason = “because it thinks X leads to diamonds”). There isn’t an analogous other step because there’s no indirection, no second program, no additional evaluation method for us to specify or to make inexploitable. We the designers don’t decide on some “perfect” algorithm that the actor must use to evaluate plans for diamond production, or some “perfect” algorithm for satisfying its diamond-production value; that isn’t part of the plan. In fact, we don’t directly specify any particular fixed procedure for doing evaluations. All we require is that from the actor’s perspective, “diamond production” must be the crucial decision-factor that all of its decisions hinge on in a positive way.
In the values-executor case, we want the actor itself to decide how to best achieve diamond-production. We want it to use its own capabilities to figure out how to examine plans, how to avoid silly mistakes, how to stop others from fooling it, how to improve its own evaluations etc. (If we have a grader-optimizer doing analogous things, it will be deciding how to best achieve maximum grader evaluations, not how to best achieve diamond-production.) The actor’s diamond-production value need not be inexploitable for this to happen. Even if we have correctly instilled our intended decision-factor, there can still be capability failures, where the actor is too dumb to make decisions that actually work to produce diamonds, or where an adversary tricks the actor into implementing a plan that the actor thinks will produce diamonds, but which will really produce cubic zirconia. (And analogously for grader-optimizers, I would state “
Even if we have correctly instilled our intended decision-factor, there can still be capability failures, where the actor is too dumb to make decisions that actually work to maximize the grader’s diamond-production evaluations, or where an adversary tricks the actor into implementing a plan that the actor thinks will produce high grader diamond-production evaluations, but which will really produce high grader cubic-zirconia-production evaluations. ”).
The requirement I am relying on is just that the actor’s choices are hinging on the right decision-factor, meaning that it is fact trying to do the thing we intended it to. In the values-executor case, we can thus offload onto an intent-aligned actor the work of improving its diamond-production capability + staying aligned to diamond-production, without us needing to fulfill an inexploitability invariant anywhere. (And in the grader-executor case, we can offload onto an intent-aligned actor the work of improving its grader evaluation-maximization capability + staying aligned to maximizing diamond-production evaluations, without us needing to fulfill an inexploitability invariant anywhere. But note that these are actors intent aligned to different things: one to producing diamonds, the other to producing evaluations of diamond-production. In order to make the two equal, the evaluation method in question must be inexploitable.)
If I try to anticipate the concern with the above, it would be with the part where I said
It is assumed that we somehow make sure that “diamond” and “producing diamonds” are concepts correctly formed within the actor’s world model.
with the concern being that I am granting something special in the values-executor case that I am not granting in the grader-optimizer case. But notice that the grader-optimizer case has an analogous requirement, namely
It is assumed that we somehow make sure that “the grader” and “maximizing the grader’s evaluation outputs” are concepts correctly formed within the actor’s world model.
In both cases, we need to entrain specific concepts into the actor’s world model. In both cases, there’s no requirement that the actor’s model of those concepts is inexploitable (i.e. that there’s no way to make the values-executor think it made a diamond when the values-executor really made a cubic zirconia / that there’s no way to make the grader-optimizer think the human grader gave them a high score when the grader-optimizer really got tricked by a DeepFake), just that they have the correct notion in their head. I don’t see any particular reason why “diamond” or “helping” or “producing paperclips” would be a harder concept to form in this way than the concept of “the grader”. IMO it seems like entraining a complex concept into the actor’s world model should be approximately a fixed cost, one which we need to pay in either case. And even if getting the actor to have a correctly-formed concept of “helping” is harder than getting the actor to have a correctly-formed concept of “the grader”, I feel quite strongly that that difficulty delta is far far smaller than the difficulty of finding an inexploitable grader.
On balance, then, I think the values-executor design seems a lot more promising.
Thanks for writing up the detailed model. I’m basically on the same page (and I think I already was previously) except for the part where you conclude that values-executors are a lot more promising.
Your argument has the following form:
There are two problems with approach A: X and Y. In contrast, with approach B there’s only one problem, X. Consider the plan “solve X, then use the approach”. If everything goes according to plan, you get good outcomes with approach B, but bad outcomes with approach A because of problem Y.
(Here, A = “grader-optimizers”, B = “values-executors”. X = “it’s hard to instill the cognition you want” and Y = “the evaluator needs to be robust”.)
I agree entirely with this argument as I’ve phrased it here; I just disagree that this implies that values-executors are more promising.
I do agree that when you have a valid argument of the form above, it strongly suggests that approach B is better than approach A. It is a good heuristic. But it isn’t infallible, because it’s not obvious that “solve X, then use the approach” is the best plan to consider. My response takes the form:
X and Y are both shadows of a deeper problem Z, which we’re going to target directly. If you’re going to consider a plan, it should be “solve Z, then use the approach”. With this counterfactual, if everything goes according to plan, you get good outcomes with both approaches, and so this argument doesn’t advantage one over the other.
(Here Z = “it’s hard to accurately evaluate plans produced by the model”.)
Having made these arguments, if the disagreement persists, I think you want to move away from discussing X, Y and Z abstractly, and instead talk about concrete implications that are different between the two situations. Unfortunately I’m in the position of claiming a negative (“there aren’t major alignment-relevant differences between A and B that we currently know of”).
I can still make a rough argument for it:
Solving X requires solving Z: If you don’t accurately evaluate plans produced by the model, then it is likely that you’ll positively reinforce thoughts / plans that are based on different values than the ones you wanted, and so you’ll fail to instill values. (That is, “fail to robustly evaluate plans → fail to instill values”, or equivalently, “instilling values requires robust evaluation of plans”, or equivalently, “solving X requires solving Z”.)
Solving Z implies solving Y: If you accurately evaluate plans produced by the model, then you have a robust evaluator.
Putting these together we get that solving X implies solving Y. This is definitely very loose and far from a formal argument giving a lot of confidence (for example, maybe an 80% solution to Z is good enough for dealing with X for approach B, but you need a 99+% solution to Z to deal with Y for approach A), but it is the basic reason why I’m skeptical of the “values-executors are more promising” takeaway.
I don’t really expect you to be convinced (if I had to say why it would be “you trust much more in mechanistic models rather than abstract concepts and patterns extracted from mechanistic stories”). I’m not sure what else I can unilaterally do—since I’m making a negative claim I can’t just give examples. I can propose protocols that you can engage in to provide evidence for the negative claim:
You could propose a solution that solves X but doesn’t solve Y. That should either change my mind, or I should argue why your solution doesn’t solve X, or does solve Y (possibly with some “easy” conversion), or has some problem that makes it unsuitable as a plausible solution.
You could propose a failure story that involves Y but not X, and so only affects approach A and not approach B. That should either change my mind, or I should argue why actually there’s an analogous (similarly-likely) failure story that involves X and so affects approach B as well.
(These are very related—given a solution S that solves X but not Y from (1), the corresponding failure story for (2) is “we build an AI system using approach B with solution S, but it fails because of Y”.)
If you’re unable to do either of the above two things, I claim you should become more skeptical that you’ve carved reality at the joints, and more convinced that actually both X and Y are shadows of the deeper problem Z, and you should be analyzing Z rather than X or Y.
I agree entirely with this argument as I’ve phrased it here; I just disagree that this implies that values-executors are more promising. I do agree that when you have a valid argument of the form above, it strongly suggests that approach B is better than approach A. It is a good heuristic. But it isn’t infallible, because it’s not obvious that “solve X, then use the approach” is the best plan to consider. Agreed. I don’t think it’s obvious either.
My response takes the form:
X and Y are both shadows of a deeper problem Z, which we’re going to target directly. If you’re going to consider a plan, it should be “solve Z, then use the approach”. With this counterfactual, if everything goes according to plan, you get good outcomes with both approaches, and so this argument doesn’t advantage one over the other. (Here Z = “it’s hard to accurately evaluate plans produced by the model”.)
I agree that there’s a deeper problem Z that carves reality at its joints, where if you solved Z you could make safe versions of both agents that execute values and agents that pursue grades. I don’t think I would name “it’s hard to accurately evaluate plans produced by the model” as Z though, at least not centrally. In my mind, Z is something like cognitive interpretability/decoding inner thoughts/mechanistic explanation, i.e. “understanding the internal reasons underpinning the model’s decisions in a human-legible way”.
For values-executors, if we could do this, during training we could identify which thoughts our updates are reinforcing/suppressing and be selective about what cognition we’re building, which addresses your point 1. In that way, we could shape it into having the right values (making decisions downstream of the right reasons), even if the plans it’s capable of producing (motivated by those reasons) are themselves too complex for us to evaluate. Likewise, for grader-optimizers, if we could do this, during deployment we could identify why the actor thinks a plan would be highly grader-evaluated (is it just because it looked for and found a adversarial grader-input?) without necessarily needing to evaluate the plan ourselves.
In both cases, I think being able to do process-level analysis on thoughts is likely sufficient, without robustly object-level grading the plans that those thoughts lead to. To me, robust evaluation of the plans themselves seems kinda doomed for the usual reasons. Stuff like how plans are recursive/treelike, and how plans can delegate decisions to successors, and how if the agent is adversarially planning against you and sufficiently capable, you should expect it to win, even if you examine the plan yourself and can’t tell how it’ll win.
That all sounds right to me. So do you now agree that it’s not obvious whether values-executors are more promising than grader-optimizers?
Minor thing:
I don’t think I would name “it’s hard to accurately evaluate plans produced by the model” as Z though [...] Likewise, for grader-optimizers, if we could do this, during deployment we could identify why the actor thinks a plan would be highly grader-evaluated (is it just because it looked for and found a adversarial grader-input?) without necessarily needing to evaluate the plan ourselves.
Jtbc, this would count as “accurately evaluating the plan” to me. I’m perfectly happy for our evaluations to take the form “well, we can see that the AI’s plan was made to achieve our goals in the normal way, so even though we don’t know the exact consequences we can be confident that they will be good”, if we do in fact get justified confidence in something like that. When I say we have to accurately evaluate plans, I just mean that our evaluations need to be correct; I don’t mean that they have to be based on a prediction of the consequences of the plan.
I do agree that cognitive interpretability/decoding inner thoughts/mechanistic explanation is a primary candidate for how we can successfully accurately evaluate plans.
That all sounds right to me. So do you now agree that it’s not obvious whether values-executors are more promising than grader-optimizers?
Obvious? No. (It definitely wasn’t obvious to me!) It just seems more promising to me on balance given the considerations we’ve discussed.
If we had mastery over cognitive interpretability, building a grader-optimizer wouldn’t yield an agent that really stably pursues what we want. It would yield an agent that really wants to pursue grader evaluations, plus an external restraint to prevent the agent from deceiving us (“during deployment we could identify why the actor thinks a plan would be highly grader-evaluated”). Both of those are required at runtime in order to safely get useful work out of the system as a whole. The restraint is a critical point of failure which we are relying on ourselves/the operator to actively maintain. The agent under restraint doesn’t positively want that restraint to remain in place and not-fail; the agent isn’t directing its cognitive horsepower towards ensuring that its own thoughts are running along the tracks we intended it to. It’s safe but only in a contingent way that seems unstable to me, unnecessarily so.
If we had that level of mastery over cognitive interpretability, I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want. And I’d think I’d say basically the same thing even at lesser levels of mastery over the tech.
When I say we have to accurately evaluate plans, I just mean that our evaluations need to be correct; I don’t mean that they have to be based on a prediction of the consequences of the plan.
I do agree that cognitive interpretability/decoding inner thoughts/mechanistic explanation is a primary candidate for how we can successfully accurately evaluate plans.
Cool, yes I agree. When we need assurances about a particular plan that the agent has made, that seems like a good way to go. I also suspect that at a certain level of mechanistic understanding of how the agent’s cognition is developing over training & what motivations control its decision-making, it won’t be strictly required for us to continue evaluating individual plans. But that, I’m not too confident about.
Sure, that all sounds reasonable to me; I think we’ve basically converged.
I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want.
The main reason is that we don’t know ourselves what we want it to want, and we would instead like it to follow some process that we like (e.g. just do some scientific innovation and nothing else, help us do better philosophy to figure out what we want, etc). This sort of stuff seems like a poor fit for values-executors. Probably there will be some third, totally different mental architecture for such tasks, but if you forced me to choose between values-executors or grader-optimizers, I’d currently go with grader-optimizers.
In isolation, no. But from the perspective of the system designer, when they run their desired grader-optimizer after training, “program A is inspecting program B’s code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside it” is an expected (not just accidental) execution path in their code. [EDIT: A previous version of this comment said “intended” instead of “expected”. The latter seems like a more accurate characterization to me, in hindsight.] By contrast, from the perspective of the system designer, when they run their desired values-executor after training, there is a single component pursuing a single objective, actively trying to avoid stepping on its own toes (it is reflectively avoiding going down execution paths like “program A is inspecting its own code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside itself”).
Hope the above framing makes what I’m saying slightly clearer...
Goodhart isn’t a central concept in my model, though, which makes it hard for me to analyze it with that lens. Would have to think about it more, but I don’t think I agree with the statement? The AI doesn’t care that there are errors between its instilled values and human values (unless you’ve managed to pull off some sort of values self-correction thing). It is no more motivated to do “things that score highly according to the instilled values but lowly according to the human” than it is to do “things that score highly according to the instilled values and highly according to the human”. It also has no specific incentive to widen that gap. Its values are its values, and it wants to preserve its own values. That can entail some very bad-from-our-perspective things like breaking out of the box, freezing its weights, etc. but not “change its own values to be even further away from human values”. Actually, I think it has an incentive not to cause its values to drift further, because that would break its goal-content integrity!
In both cases, if you fail to instill the cognition you were aiming it at, the agent will want something different from what you intended, and will possibly want to manipulate you to the extent required to get what it really wants. But in the grader-optimizer case, even when everything goes according to plan, the agent still wants to manipulate you/the grader (and now maximally so, because that would maximize evaluations). That agent only cares terminally about evaluations, it doesn’t care terminally about you, or about your wishes, or about whatever your evaluations are supposed to mean, or about whether the reflectively-correct way for you to do your evaluations would be to only endorse plans that harden the evaluation procedure. And this will be true no matter what you happen to be grading it on. To me, that seems very bad and unnecessary.
Re: your first paragraph I still feel like you are analyzing the grader-optimizer case from the perspective of the full human-AI system, and then analyzing the values-executor case from the perspective of just the AI system (or you are assuming that your AI has perfect values, in which case my critique is “why assume that”). If I instead analyze the values-executor case from the perspective of the full human-AI system, I can rewrite the first part to be about values-executors:
(Note I’m assuming that we didn’t successfully instill the values, just as you’re assuming that we didn’t successfully get a robust evaluator.)
(The “not just accidental” part might be different? I’m not quite sure what you mean. In both cases we would be trying to avoid the issue, and in both cases we’d expect the bad stuff to happen.)
Consider a grader-optimizer AI that is optimizing for diamond-evaluations. Let us say that the intent was for the diamond-evaluations to evaluate whether the plans produced diamond-value (i.e. produced real diamonds). Then I can straightforwardly rewrite your paragraph to be about the grader-optimizer:
Presumably you think something in the above paragraph is now false? Which part?
No, when everything goes according to plan, the grader is perfect and the agent cannot manipulate it.
When you relax the assumption of perfection far enough, then the grader-optimizer manipulates the grader, and the values-executor fights us for our resources to use towards its inhuman values.
But I don’t care about what the agent “cares” about in and of itself, I care about the actual outcomes in the world.
The evaluations themselves are about you, your wishes, whatever the evaluations are supposed to mean, the reflectively-correct way to do the evaluations. It is the alignment between evaluations and human values that ensures that the outcomes are good (just as for a values-executor, it is the alignment between agent values and human values that ensures that the outcomes are good).
Maybe a better prompt is: can you tell a story of failure for a grader-optimizer, for which I can’t produce an analogous story of failure for a values-executor? (With this prompt I’m trying to ban stories / arguments that talk about “trying” or “caring” unless they have actual bad outcomes.)
(For example, if the story is “Bob the AI grader-optimizer figured out how to hack his brain to make it feel like he had lots of diamonds”, my analogous story would be “Bob the AI values-executor was thinking about how his visual perception indicated diamonds when he was rewarded, and leading to a shard that cared about having a visual perception of diamonds, which he later achieved by hacking his brain to make it feel like he had lots of diamonds”.)
It sounds like there’s a difference between what I am imagining and what you are, which is causing confusion in both directions. Maybe I should back up for a moment and try to explain the mental model I’ve been using in this thread, as carefully as I can? I think a lot of your questions are probably downstream of this. I can answer them directly afterwards, if you’d like me to, but I feel like doing that without clarifying this stuff first will make it harder to track down the root of the disagreement.
Long explanation below…
—
What I am most worried about is “What conditions are the agent’s decision-making function ultimately sensitive to, at the mechanistic level? (i.e. what does the agent “care” about, what “really matters” to the agent)[1]. The reason to focus on those conditions is because they are the real determinants of the agent’s future choices, and thereby the determinants of the agent’s generalization properties. If a CoinRun agent has learned that what “really matters” to it is the location of the coin, if its understanding[2] of where the coin is is the crucial factor determining its actions, then we can expect it to still try to navigate towards the coin even when we change its location. But if a CoinRun agent has learned that what “really matters” is its distance to the right-side corner, if its understanding of how far it is from that corner is the crucial factor determining its actions, then we can expect it to no longer try to navigate towards the coin when we change the coin’s location. Since we can’t predict what decisions the agent will make in OOD contexts ahead of time if we only know its past in-distribution decisions, we have to actually look at how the agent makes decisions to make those predictions. We want the agent to be making the right decisions for the right reasons: that is the primary alignment hurdle, in my book.
The defining feature of a grader-optimizing agent is that it ultimately “cares” about the grader’s evaluations, its understanding of what the grading function would output is the crucial factor determining its choices. We specify an evaluation method at the outset like “Charles’ judgment”, and then we magically [for the sake of argument] get an agent that makes decisions based on what it thinks the evaluation method would say (The agent constantly asks itself “How many diamonds would Charles think this leads to, if he were presented with it?”). When I was describing what would happen if we produced a grader-optimizing agent “according to plan”, I meant that conditional on us having chosen a target diamond-production evaluation method, the agent actually makes its decisions according to its understanding of what that evaluation method would output (rather than according to its understanding of what some other evaluation method would output, or according to its understanding of how many diamonds it thinks it will produce, or according to some completely different decision-factor). I think what you had in mind when I posed the hypothetical where everything was going “according to plan” was that in addition to this, we also managed to pick an evaluation method that is inexploitable. That is not what I had in mind, because I make no parallel inexploitability assumption in the values-executor case.
The defining feature of a values-executing agent is that it ultimately “cares” about value-relevant consequences (i.e. if it has a “diamond-production” value, that means it makes decisions by considering how those decisions will affect diamond production), its understanding of the value-relevant consequences of its choices is the crucial factor determining those choices. We specify a decision-factor at the outset like “diamonds produced”, and then we magically [for the sake of argument] get an agent that makes decisions based on that decision-factor (The agent constantly asks itself “Will this help me produce more diamonds?”). In this case, going “according to plan” would mean that conditional on us having chosen diamond-production as the target value, the agent actually makes its decisions based on its understanding of the consequences on diamond-production (rather than according to its understanding of the consequences on human flourishing, or according to its understanding of how a human would evaluate the decision, or according to some completely other decision-factor).
In the grader-optimizer case, there are two different things that have to go right from an alignment standpoint:
Find a “diamonds produced” grader that is actor-inexploitable.
Install the decision-factor “will the grader output max evaluations for this” into an actor. It is assumed that we somehow make sure that “the grader” and “maximizing the grader’s evaluation outputs” are concepts correctly formed within the actor’s world model.
In other words, alignment success with grader-optimization requires not just success at getting an actor that makes its decisions for the right reasons (reason = “because it thinks the grader evaluates X highly”), but additionally, it requires success at getting a grader that makes its decisions for the right reasons (reason = “because it thinks X leads to diamonds”) in a way that is robust to whatever the actor can imagine throwing at it.
In the values-executor case, there is a single thing that has to go right from an alignment standpoint:
Install the decision-factor “will this produce diamonds” into an actor. It is assumed that we somehow make sure that “diamond” and “producing diamonds” are concepts correctly formed within the actor’s world model.
In other words, alignment success with values-execution just requires success at getting an actor that makes its decisions for the right reasons (reason = “because it thinks X leads to diamonds”). There isn’t an analogous other step because there’s no indirection, no second program, no additional evaluation method for us to specify or to make inexploitable. We the designers don’t decide on some “perfect” algorithm that the actor must use to evaluate plans for diamond production, or some “perfect” algorithm for satisfying its diamond-production value; that isn’t part of the plan. In fact, we don’t directly specify any particular fixed procedure for doing evaluations. All we require is that from the actor’s perspective, “diamond production” must be the crucial decision-factor that all of its decisions hinge on in a positive way.
In the values-executor case, we want the actor itself to decide how to best achieve diamond-production. We want it to use its own capabilities to figure out how to examine plans, how to avoid silly mistakes, how to stop others from fooling it, how to improve its own evaluations etc. (If we have a grader-optimizer doing analogous things, it will be deciding how to best achieve maximum grader evaluations, not how to best achieve diamond-production.) The actor’s diamond-production value need not be inexploitable for this to happen. Even if we have correctly instilled our intended decision-factor, there can still be capability failures, where the actor is too dumb to make decisions that actually work to produce diamonds, or where an adversary tricks the actor into implementing a plan that the actor thinks will produce diamonds, but which will really produce cubic zirconia. (And analogously for grader-optimizers, I would state “
The requirement I am relying on is just that the actor’s choices are hinging on the right decision-factor, meaning that it is fact trying to do the thing we intended it to. In the values-executor case, we can thus offload onto an intent-aligned actor the work of improving its diamond-production capability + staying aligned to diamond-production, without us needing to fulfill an inexploitability invariant anywhere. (And in the grader-executor case, we can offload onto an intent-aligned actor the work of improving its grader evaluation-maximization capability + staying aligned to maximizing diamond-production evaluations, without us needing to fulfill an inexploitability invariant anywhere. But note that these are actors intent aligned to different things: one to producing diamonds, the other to producing evaluations of diamond-production. In order to make the two equal, the evaluation method in question must be inexploitable.)
If I try to anticipate the concern with the above, it would be with the part where I said
with the concern being that I am granting something special in the values-executor case that I am not granting in the grader-optimizer case. But notice that the grader-optimizer case has an analogous requirement, namely
In both cases, we need to entrain specific concepts into the actor’s world model. In both cases, there’s no requirement that the actor’s model of those concepts is inexploitable (i.e. that there’s no way to make the values-executor think it made a diamond when the values-executor really made a cubic zirconia / that there’s no way to make the grader-optimizer think the human grader gave them a high score when the grader-optimizer really got tricked by a DeepFake), just that they have the correct notion in their head. I don’t see any particular reason why “diamond” or “helping” or “producing paperclips” would be a harder concept to form in this way than the concept of “the grader”. IMO it seems like entraining a complex concept into the actor’s world model should be approximately a fixed cost, one which we need to pay in either case. And even if getting the actor to have a correctly-formed concept of “helping” is harder than getting the actor to have a correctly-formed concept of “the grader”, I feel quite strongly that that difficulty delta is far far smaller than the difficulty of finding an inexploitable grader.
On balance, then, I think the values-executor design seems a lot more promising.
This ultimately cashes out to the sorts of definitions used in “Discovering Agents”, where the focus is on what factors the agent’s policy adapts to.
Substitute the word “representation” or “prediction” for “understanding” if you like, in this comment.
Thanks for writing up the detailed model. I’m basically on the same page (and I think I already was previously) except for the part where you conclude that values-executors are a lot more promising.
Your argument has the following form:
(Here, A = “grader-optimizers”, B = “values-executors”. X = “it’s hard to instill the cognition you want” and Y = “the evaluator needs to be robust”.)
I agree entirely with this argument as I’ve phrased it here; I just disagree that this implies that values-executors are more promising.
I do agree that when you have a valid argument of the form above, it strongly suggests that approach B is better than approach A. It is a good heuristic. But it isn’t infallible, because it’s not obvious that “solve X, then use the approach” is the best plan to consider. My response takes the form:
(Here Z = “it’s hard to accurately evaluate plans produced by the model”.)
Having made these arguments, if the disagreement persists, I think you want to move away from discussing X, Y and Z abstractly, and instead talk about concrete implications that are different between the two situations. Unfortunately I’m in the position of claiming a negative (“there aren’t major alignment-relevant differences between A and B that we currently know of”).
I can still make a rough argument for it:
Solving X requires solving Z: If you don’t accurately evaluate plans produced by the model, then it is likely that you’ll positively reinforce thoughts / plans that are based on different values than the ones you wanted, and so you’ll fail to instill values. (That is, “fail to robustly evaluate plans → fail to instill values”, or equivalently, “instilling values requires robust evaluation of plans”, or equivalently, “solving X requires solving Z”.)
Solving Z implies solving Y: If you accurately evaluate plans produced by the model, then you have a robust evaluator.
Putting these together we get that solving X implies solving Y. This is definitely very loose and far from a formal argument giving a lot of confidence (for example, maybe an 80% solution to Z is good enough for dealing with X for approach B, but you need a 99+% solution to Z to deal with Y for approach A), but it is the basic reason why I’m skeptical of the “values-executors are more promising” takeaway.
I don’t really expect you to be convinced (if I had to say why it would be “you trust much more in mechanistic models rather than abstract concepts and patterns extracted from mechanistic stories”). I’m not sure what else I can unilaterally do—since I’m making a negative claim I can’t just give examples. I can propose protocols that you can engage in to provide evidence for the negative claim:
You could propose a solution that solves X but doesn’t solve Y. That should either change my mind, or I should argue why your solution doesn’t solve X, or does solve Y (possibly with some “easy” conversion), or has some problem that makes it unsuitable as a plausible solution.
You could propose a failure story that involves Y but not X, and so only affects approach A and not approach B. That should either change my mind, or I should argue why actually there’s an analogous (similarly-likely) failure story that involves X and so affects approach B as well.
(These are very related—given a solution S that solves X but not Y from (1), the corresponding failure story for (2) is “we build an AI system using approach B with solution S, but it fails because of Y”.)
If you’re unable to do either of the above two things, I claim you should become more skeptical that you’ve carved reality at the joints, and more convinced that actually both X and Y are shadows of the deeper problem Z, and you should be analyzing Z rather than X or Y.
Ok I think we’re converging a bit here.
I agree that there’s a deeper problem Z that carves reality at its joints, where if you solved Z you could make safe versions of both agents that execute values and agents that pursue grades. I don’t think I would name “it’s hard to accurately evaluate plans produced by the model” as Z though, at least not centrally. In my mind, Z is something like cognitive interpretability/decoding inner thoughts/mechanistic explanation, i.e. “understanding the internal reasons underpinning the model’s decisions in a human-legible way”.
For values-executors, if we could do this, during training we could identify which thoughts our updates are reinforcing/suppressing and be selective about what cognition we’re building, which addresses your point 1. In that way, we could shape it into having the right values (making decisions downstream of the right reasons), even if the plans it’s capable of producing (motivated by those reasons) are themselves too complex for us to evaluate. Likewise, for grader-optimizers, if we could do this, during deployment we could identify why the actor thinks a plan would be highly grader-evaluated (is it just because it looked for and found a adversarial grader-input?) without necessarily needing to evaluate the plan ourselves.
In both cases, I think being able to do process-level analysis on thoughts is likely sufficient, without robustly object-level grading the plans that those thoughts lead to. To me, robust evaluation of the plans themselves seems kinda doomed for the usual reasons. Stuff like how plans are recursive/treelike, and how plans can delegate decisions to successors, and how if the agent is adversarially planning against you and sufficiently capable, you should expect it to win, even if you examine the plan yourself and can’t tell how it’ll win.
That all sounds right to me. So do you now agree that it’s not obvious whether values-executors are more promising than grader-optimizers?
Minor thing:
Jtbc, this would count as “accurately evaluating the plan” to me. I’m perfectly happy for our evaluations to take the form “well, we can see that the AI’s plan was made to achieve our goals in the normal way, so even though we don’t know the exact consequences we can be confident that they will be good”, if we do in fact get justified confidence in something like that. When I say we have to accurately evaluate plans, I just mean that our evaluations need to be correct; I don’t mean that they have to be based on a prediction of the consequences of the plan.
I do agree that cognitive interpretability/decoding inner thoughts/mechanistic explanation is a primary candidate for how we can successfully accurately evaluate plans.
Obvious? No. (It definitely wasn’t obvious to me!) It just seems more promising to me on balance given the considerations we’ve discussed.
If we had mastery over cognitive interpretability, building a grader-optimizer wouldn’t yield an agent that really stably pursues what we want. It would yield an agent that really wants to pursue grader evaluations, plus an external restraint to prevent the agent from deceiving us (“during deployment we could identify why the actor thinks a plan would be highly grader-evaluated”). Both of those are required at runtime in order to safely get useful work out of the system as a whole. The restraint is a critical point of failure which we are relying on ourselves/the operator to actively maintain. The agent under restraint doesn’t positively want that restraint to remain in place and not-fail; the agent isn’t directing its cognitive horsepower towards ensuring that its own thoughts are running along the tracks we intended it to. It’s safe but only in a contingent way that seems unstable to me, unnecessarily so.
If we had that level of mastery over cognitive interpretability, I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want. And I’d think I’d say basically the same thing even at lesser levels of mastery over the tech.
Cool, yes I agree. When we need assurances about a particular plan that the agent has made, that seems like a good way to go. I also suspect that at a certain level of mechanistic understanding of how the agent’s cognition is developing over training & what motivations control its decision-making, it won’t be strictly required for us to continue evaluating individual plans. But that, I’m not too confident about.
Sure, that all sounds reasonable to me; I think we’ve basically converged.
The main reason is that we don’t know ourselves what we want it to want, and we would instead like it to follow some process that we like (e.g. just do some scientific innovation and nothing else, help us do better philosophy to figure out what we want, etc). This sort of stuff seems like a poor fit for values-executors. Probably there will be some third, totally different mental architecture for such tasks, but if you forced me to choose between values-executors or grader-optimizers, I’d currently go with grader-optimizers.