We’re building intelligent AI systems that help us do stuff. Regardless of how the AI’s internal cognition works, it seems clear that the plans / actions it enacts have to be extremely strongly selected. With alignment, we’re trying to ensure that they are strongly selected to produce good outcomes, rather than being strongly selected for something else. So for any alignment proposal I want to see some reason that argues for “good outcomes” rather than “something else”.
In nearly all of the proposals I know of that seem like they have a chance of helping, at a high level the reason is “human(s) are a source of information about what is good, and this information influences what the AI’s plans are selected for”. (There are some cases based on moral realism.)
This is also the case with value-child: in that case, the mother is a source of info on what is good, she uses this to instill values in the child, those values then influence which plans value-child ends up enacting.
All such stories have a risk: what if the process of using [info about what is good] to influence [that which plans are selected for] goes wrong, and instead plans are strongly selected for some slightly-different thing? Then because optimization amplifies and value is fragile, the plans will produce bad outcomes.
I view this post as instantiating this argument for one particular class of proposals: cases in which we build an AI system that explicitly searches over a large space of plans, predicts their consequences, rates the consequences according to a prediction of what is “good”, and executes the highest-scoring plan. In such cases, you can more precisely restate “plans are strongly selected for some slightly-different thing” to “the agent executes plans that cause upwards-errors in the prediction of what is good”.
It’s an important argument! If you want to have an accurate picture of how likely such plans are to work, you really need to consider this point!
The part where I disagree is where the post goes on to say “and so we shouldn’t do this”. My response: what is the alternative, and why does it avoid or lessen the more abstract risk above?
I’d assume that the idea is that you produce AI systems that are more like “value-child”. Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?
So far, this is a bit unfair to the post(s). It does have some additional arguments, which I’m going to rewrite in totally different language which I might be getting horribly wrong:
An AI system with a “direct (object-level) goal” is better than one with “indirect goals”. Specifically, you could imagine two things: (a) plans are selected for a direct goal (e.g. “make diamonds”) encoded inside the AI system, vs. (b) plans are selected for being evaluated as good by something encoded outside the AI system (e.g. “Alice’s approval”). I think the idea is that indirect goals clearly have issues (because the AI system is incentivized to trick the evaluator), while the direct goal has some shot at working, so we should aim for the direct goal.
I don’t buy this as stated; just as “you have a literally perfect overseer” seems theoretically possible but unrealistic, so too does “you instill the direct goal literally exactly correctly”. Presumably one of these works better in practice than the other, but it’s not obvious to me which one it is.
Separately, I don’t see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal, I’d still be interested in iterated amplification, debate, interpretability, etc, because all of those seem particularly useful for instilling direct goals (given the deep learning paradigm). In particular even with a shard lens I’d be thinking about “how do I notice if my agent grew a shard that was subtly different from what I wanted” and I’d think of amplifying oversight as an obvious approach to tackle this problem. Personally I think it’s pretty likely that most of the AI systems we build and align in the near-to-medium term will have direct goals, even if we use techniques like iterated amplification and debate to build them.
Plan generation is safer. One theme is that with realistic agent cognition you only generate, say, 2-3 plans, and choose amongst those, which is very different from searching over all possible plans. I don’t think this inherently buys you any safety; this just means that you now have to consider how those 2-3 plans were generated (since they are presumably not random plans). Then you could make other arguments for safety (idk if the post endorses any of these):
Plans are selected based on historical experience. Instead of considering novel plans where you are relying more on your predictions of how the plans will play out, the AI could instead only consider plans that are very similar to plans that have been tried previously (by humans or AIs), where we have seen how such plans have played out and so have a better idea of whether they are good or not. I think that if we somehow accomplished this it would meaningfully improve safety in the medium term, but eventually we will want to have very novel plans as well and then we’d be back to our original problem.
Plans are selected from amongst a safe subset of plans. This could in theory work, but my next question would be “what is this safe subset, and why do you expect plans to be selected from it?” That’s not to say it’s impossible, just that I don’t see the argument for it.
Plans are selected based on values. In other words we’ve instilled values into the AI system, the plans are selected for those values. I’d critique this the same way as above, i.e. it’s really unclear how we successfully instilled values into the AI system and we could have instilled subtly wrong values instead.
Plans aren’t selected strongly. You could say that the 2-3 plans aren’t strongly selected for anything, so they aren’t likely to run into these issues. I think this is assuming that your AI system isn’t very capable; this sounds like the route of “don’t build powerful AI” (which is a plausible route).
In summary:
Intelligence ⇒ strong selection pressure ⇒ bad outcomes if the selection pressure is off target.
In the case of agents that are motivated to optimize evaluations of plans, this argument turns into “what if the agent tricks the evaluator”.
In the case of agents that pursue values / shards instilled by some other process, this argument turns into “what if the values / shards are different from what we wanted”.
To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.
Strong-upvoted and strong-disagreevoted. Thanks so much for the thoughtful comment.
I’m rushing to get a lot of content out, so I’m going to summarize my main reactions now & will be happy to come back later.
I wish you wouldn’t use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis.
I consider your arguments to blur nearly-unalignable design patterns (e.g. grader optimization) with shard-based agents, and then comment that both patterns pose challenges, so can we really say one is better? More on this later.
As Charles and Adam seem to say, you seem to be asking “how did you specify the values properly?” without likewise demanding “how do we inner-align the actor? How did we specify the grader?”.
Given an inner-aligned actor and a grader which truly cares about diamonds, you don’t get an actor/grader which makes diamonds.
Given a value-AGI which truly cares about diamonds, the AGI makes diamonds.
If anything, the former seems to require more specification difficulty, and yet it still horribly fails.
just as “you have a literally perfect overseer” seems theoretically possible but unrealistic, so too does “you instill the direct goal literally exactly correctly”. Presumably one of these works better in practice than the other, but it’s not obvious to me which one it is.
You do not need an agent to have perfect values. As you commented below, a values-AGI with Rohin’s current values seems about as good as a values-AGI with Rohin’s CEV. Many foundational arguments are about grader-optimization, so you can’t syntactically conclude “imperfect values means doom.” That’s true in the grader case, but not here.
That reasoning is not immediately applicable to “how stable is diamond-producing behavior to various perturbations of the agent’s initial decision-influences (i.e. shards)?”. That’s all values are, on my terminology. Values are contextually activated influences on decision-making. That’s it. Values are not the optimization target of the agent with those values. If you drop out or weaken the influence of IF plan can be easily modified to incorporate more diamonds, THEN do it, that won’t necessarily mean the AI makes some crazy diamond-less universe. It means that it stops tailoring plans in a certain way, in a certain situation.
This is also why more than one person has “truly” loved their mother for more than a single hour (else their values might change away from true perfection). It’s not like there’s an “literally exactly correct” value-shard for loving someone.
This is also why values can be seriously perturbed but still end up OK. Imagine a value-shard which controls all decision-making when I’m shown a certain QR code, but which is otherwise inactive. My long-run outcomes probably wouldn’t differ, and I expect the same for an AGI.
The value shards aren’t getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).
So, I’m basically asking that you throw an error and recheck your “selection on imperfection → doom” arguments, as I claim many of these arguments reflect grader-specific problems.
Separately, I don’t see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal,
It is extremely relevant, unless we want tons of our alignment theory to be predicated on IMO confused ideas about how agent motivations work, or what values we want in an agent, or the relative amount of time we spend researching “objective robustness” (often unhelpful IMO) vs interpretability vs cognitive-update dynamics (e.g. what reward shaping does mechanistically to a network in different situations) vs… If we stay in the grader-optimization frame, I think we’re going to waste a bunch of time figuring out how to get inexploitable graders.
It would be quite stunning if, after renouncing one high-level world-view of how agent motivations work, the optimal research allocation remained the same.
I agree that if you do IDA or debate or whatever, you get agents with direct goals. Which invalidates a bunch of analysis around indirect goals—not only do I think we shouldn’t designgrader-optimizers, I think we thankfully won’t get them.
I wish you wouldn’t use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis.
Can you name a way in which my arguments prove too much? That seems like a relatively concrete thing that we should be able to get agreement on.
You do not need an agent to have perfect values.
I did not claim (nor do I believe) the converse.
Many foundational arguments are about grader-optimization, so you can’t syntactically conclude “imperfect values means doom.” That’s true in the grader case, but not here.
I disagree that this is true in the grader case. You can have a grader that isn’t fully robust but is sufficiently robust that the agent can’t exploit any errors it would make.
If you drop out or weaken the influence of IF plan can be easily modified to incorporate more diamonds, THEN do it, that won’t necessarily mean the AI makes some crazy diamond-less universe.
The difficulty in instilling values is not that removing a single piece of the program / shard that encodes it will destroy the value. The difficulty is that when you were instilling the value, you accidentally rewarded a case where the agent tried a plan that produced pictures of diamonds (because you thought they were real diamonds), and now you’ve instilled a shard that upweights plans that produce pictures of diamonds. Or that you rewarded the agent for thoughts like “this will make pretty, transparent rocks” (which did lead to plans that produced diamonds), leading to shards that upweight plans that produce pretty, transparent rocks, and then later the agent tiles the universe with clear quartz.
The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).
So, I’m basically asking that you throw an error and recheck your “selection on imperfection → doom” arguments, as I claim many of these arguments reflect grader-specific problems.
I think that the standard arguments work just fine for arguing that “incorrect value shards → doom”, precisely because the incorrect value shards are the things that optimize hard.
(Here incorrect value shards means things like “the value shards put their influence towards plans producing pictures of diamonds” and not “the diamond-shard, but without this particular if clause”.)
It is extremely relevant [...]
This doesn’t seem like a response to the argument in the paragraph that you quoted; if it was meant to be then I’ll need you to rephrase it.
Intelligence ⇒ strong selection pressure ⇒ bad outcomes if the selection pressure is off target.
In the case of agents that are motivated to optimize evaluations of plans, this argument turns into “what if the agent tricks the evaluator”.
In the case of agents that pursue values / shards instilled by some other process, this argument turns into “what if the values / shards are different from what we wanted”.
To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.
One thing that is not clear to me from your comment is what you make of Alex’s argument (as I see it) to the extent that “evaluation goals” are further away from “direct goals” than “direct goals” are between themselves. If I run with this, it seems like an answer to your point 4 would be:
with directly instilled goals, there will be some risk of discrepancy that can explode due to selection pressure;
with evaluation based goals, there is the same discrepancy than between directly instilled goals (because it’s hard to get your goal exactly right) plus an additional discrepancy between valuing “the evaluation of X” and valuing “X”.
I’m curious what you think of this claim, and if that influences at all your take.
I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don’t really buy that, seems like it depends on the size of the discrepancies.
For example, if you imagine an AI that’s optimizing for my evaluation of good, I think the discrepancy between “Rohin’s directly instilled goals” and “Rohin’s CEV” is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I’d conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between “plans Rohin evaluates as good” and “Rohin’s directly instilled goals”.
I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.
This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the “difficulties” are only positive numbers, then if the difficulty for the direct instillation is dinstillation and the one for the grader optimization is dinstillation+devaluation , then there’s no debate that the latter is bigger than the former.
But if you allow directionality (even in one dimension), then there’s the risk that the sum leads to less difficulty in total (by having the devaluation move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don’t see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.
Grader-optimization has the benefit that you don’t have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I’d estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.
Having direct-values of “Rohin’s current values” and “Rohin’s CEV” both seem fine. There is, however, significant discrepancy between “grader-optimize the plans Rohin evaluates as good” and “directly-value Rohin’s current values.”
In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
the first line seems to speculate that values-AGI is substantially more robust to differences in values
The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).
We need to apply extremely strong selection to get the kind of agent we want, and the agent we want will itself need to be making decisions that are extremely optimized in order to achieve powerfully good outcomes. The question is about in what way that decision-making algorithm should be structured, not whether it should be optimized/optimizing at all. As a fairly close analogy, IMO a point in the Death With Dignity post was something like “for most people, the actually consequentialist-correct choice is NOT to try explicitly reasoning about consequences”. Similarly, the best way for an agent to actually produce highly-optimized good-by-its-values outcomes through planning may not be by running an explicit search over the space of ~all plans, sticking each of them into its value-estimator, & picking the argmax plan.
I think there still may be some mixup between:
A. How does the cognition-we-intend-the-agent-to-have operate? (for ex. a plan grader + an actor that tries to argmax the grader, or a MuZero-like heuristic tree searcher, or a chain-of-thought LLM steered by normative self-talk, or something else)
B. How we get the agent to have the intended cognition?
In the post TurnTrout is focused on A, arguing that grader-optimization is a kind of cognition that works at cross purposes with itself, one that is an anti-pattern, one that an agent (even an unaligned agent) should discard upon reflection because it works against its own interests. He explicitly disclaims that he is not making arguments about B, about whether we should use a grader in the training process or about what goes wrong during training (see Clarification 1). “What if the agent tricks the evaluator” (your summary point 2) is a question about A, about this internal inconsistency in the structure of the agent’s thought process.
By contrast, “What if the values/shards are different from what we wanted” (your summary point 3) is a question about B! Note that we have to confront B-like questions no matter how we answer A. If A = grader-optimization, there’s an analogous question of “What if the grader is different from what we wanted? / What if the trained actor is different from what we wanted?”. I don’t really see an issue with this post focusing exclusively on the A-like dimension of the problem and ignoring the B-like dimension temporarily, especially if we expect there to be general purpose methods that work across different answers to A.
In the post TurnTrout is focused on A [...] He explicitly disclaims that he is not making arguments about B
I agree that’s what the post does, but part of my response is that the thing we care about is both A and B, and the problems that arise for grader-optimization in A (highlighted in this post) also arise for value-instilling in B in slightly different form, and so if you actually want to compare the two proposals you need to think about both.
I’d be on board with a version of this post where the conclusion was “there are some problems with grader-optimization, but it might still be the best approach; I’m not making a claim on that one way or the other”.
grader-optimization is a kind of cognition that works at cross purposes with itself, one that is an anti-pattern, one that an agent (even an unaligned agent) should discard upon reflection because it works against its own interests.
I didn’t actually mention this in my comment, but I don’t buy this argument:
Case 1: no meta cognition. Grader optimization only “works at cross purposes with itself” to the extent that the agent thinks that the grader might be mistaken about things. But it’s not clear why this is the case: if the agent thinks “my grader is mistaken” that means there’s some broader meta-cognition in the agent that does stuff based on something other than the grader. That meta-cognition could just not be there and then the agent would be straightforwardly optimizing for grader-outputs.
As a concrete example, AIXI seems to me like an example of grader-optimization (since the reward signal comes from outside the agent). I do not think AIXI would “do better according to its own interests” if it “discarded” its grader-optimization.
You can say something like “from the perspective of the human-AI system overall, having an AI motivated by grader-optimization is building a system that works at cross purposes itself”, but then we get back to the response “but what is the alternative”.
Case 2: with meta cognition. If we instead assume that there is some meta cognition reflecting on whether the grader might be mistaken, then it’s not clear to me that this failure mode only applies to grader optimization; you can similarly have meta cognition reflecting on whether values are mistaken.
Suppose you instill diamond-values into an AI. Now the AI is thinking about how it can improve the efficiency of its diamond-manufacturing, and has an idea that reduces the necessary energy requirements at the cost of introducing some impurities. Is this good? The AI doesn’t know; it’s unsure what level of impurities is acceptable before the thing it is making is no longer a diamond. Efficiency is very important, even 0.001% improvement is a massive on an absolute scale given its fleets of diamond factories, so it spends some time reflecting on the concept of diamonds to figure out whether the impurities are acceptable.
It seems like you could describe this as “the AI’s plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack”. So what’s the difference between this issue and the issue with grader optimization?
I agree that’s what the post does, but part of my response is that the thing we care about is both A and B, and the problems that arise for grader-optimization in A (highlighted in this post) also arise for value-instilling in B in slightly different form, and so if you actually want to compare the two proposals you need to think about both.
If I understand you correctly, I think that the problems you’re pointing out with value-instilling in B (you might get different values from the ones you wanted) are the exact same problems that arise for grader-optimization in B (you might get a different grader from the one you wanted if it’s a learned grader, and you might get a different actor from the one you wanted). So when I am comparing grader-optimizing vs non-grader-optimizing motivational designs, both of them have the same problem in B, and grader-optimizing ones additionally have the A-problems highlighted in the post. Maybe I am misunderstanding you on this point, though...
I’d be on board with a version of this post where the conclusion was “there are some problems with grader-optimization, but it might still be the best approach; I’m not making a claim on that one way or the other”.
I dunno what TurnTrout’s level of confidence is, but I still think it’s possible that it’s the best way to structure an agent’s motivations. It just seems unlikely to me. It still seems probable, though, that using a grader will be a key component in the best approaches to training the AI.
Case 1: no meta cognition. Grader optimization only “works at cross purposes with itself” to the extent that the agent thinks that the grader might be mistaken about things. But it’s not clear why this is the case: if the agent thinks “my grader is mistaken” that means there’s some broader meta-cognition in the agent that does stuff based on something other than the grader. That meta-cognition could just not be there and then the agent would be straightforwardly optimizing for grader-outputs.
True. I claimed too much there. It is indeed possible for “my grader is mistaken” to not make sense in certain designs. My mental model was one where the grader works by predicting the consequences of a plan and then scoring those consequences, so it can be “mistaken” from the agent’s perspective if it predicts the wrong consequences (but not in the scores it assigns to consequences). That was the extent of metacognition I imagined by default in a grader-optimizer. But I agree with your overall point that in the general no-metacognition case, my statement is false. I can add an edit to the bottom, to reflect that.
You can say something like “from the perspective of the human-AI system overall, having an AI motivated by grader-optimization is building a system that works at cross purposes itself”, but then we get back to the response “but what is the alternative”.
Yeah, that statement would’ve been a fairer/more accurate thing for me to have said. However, I am confused by the response. By “but what is the alternative”, did you mean to imply that there are no possible alternative motivational designs to “2-part system composed of an actor whose goal is to max out a grader’s evaluation of X”? (What about “a single actor with X as a direct goal”?) Or perhaps that they’re all plagued by this issue? (I think they probably have their own safety problems, but not this kind of problem.) Or that, on balance, all of them are worse?
Case 2: with meta cognition. If we instead assume that there is some meta cognition reflecting on whether the grader might be mistaken, then it’s not clear to me that this failure mode only applies to grader optimization; you can similarly have meta cognition reflecting on whether values are mistaken.
Suppose you instill diamond-values into an AI. Now the AI is thinking about how it can improve the efficiency of its diamond-manufacturing, and has an idea that reduces the necessary energy requirements at the cost of introducing some impurities. Is this good? The AI doesn’t know; it’s unsure what level of impurities is acceptable before the thing it is making is no longer a diamond. Efficiency is very important, even 0.001% improvement is a massive on an absolute scale given its fleets of diamond factories, so it spends some time reflecting on the concept of diamonds to figure out whether the impurities are acceptable.
It seems like you could describe this as “the AI’s plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack”. So what’s the difference between this issue and the issue with grader optimization?
If as stated you’ve instilled diamond-values into the AI, then whatever efficiency-improvement-thinking it is doing is being guided by the motivation to make lots of diamonds. As it is dead-set on making lots of diamonds, this motivation permeates not only its thoughts/plans about object-level external actions, but also its thoughts/plans about its own thoughts/plans &c. (IMO this full-fledged metacognition / reflective planning is the thing that lets the agent not subject itself to the strongest form of the Optimizer’s Curse.) If it notices that its goal of making lots of diamonds is threatened by the lack of hardening of its diamond-concept, it will try to expend the effort required to reduce that threat (but not unbounded effort, because the threat is not unboundedly big). By contrast, there is not some separate part of its cognition that runs untethered to its diamond-goal, or that is trying to find adversarial examples that trick itself. It can make mistakes (recognized as such in hindsight), and thereby “implicitly” fall prey to upward errors in its diamond-concept (in the same way as you “implicitly” fall prey to small errors when choosing between the first 2-3 plans that come to your mind), but it is actively trying not to, limited mainly by its capability level.
The AI is positively motivated to develop lines of thought like “This seems uncertain, what are other ways to improve efficiency that aren’t ambiguous according to my values?” and “Modifying my diamond-value today will probably, down the line, lead to me tiling the universe with things I currently consider clearly not diamonds, so I shouldn’t budge on it today.” and “99.9% purity is within my current diamond-concept, but I predict it is only safe for me to make that change if I first figure out how to pre-commit to not lower that threshold. Let’s do that first!”.
Concretely, I would place a low-confidence guess that unless you took special care to instill some desires around diamond purity, the AI would end up accepting roughly similar (in OOM terms) levels of impurity as it understood to be essential when it was originally forming its concept of diamonds. But like I said, low confidence.
If I understand you correctly, I think that the problems you’re pointing out with value-instilling in B (you might get different values from the ones you wanted) are the exact same problems that arise for grader-optimization in B (you might get a different grader from the one you wanted if it’s a learned grader, and you might get a different actor from the one you wanted). So when I am comparing grader-optimizing vs non-grader-optimizing motivational designs, both of them have the same problem in B, and grader-optimizing ones additionally have the A-problems highlighted in the post. Maybe I am misunderstanding you on this point, though...
Grader-optimization has the benefit that you don’t have to define what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
Part of my point is that the machinery you need to solve A-problems is also needed to solve B-problems because fundamentally they are shadows of the same problem.
However, I am confused by the response. By “but what is the alternative”, did you mean to imply [...]
I didn’t mean to imply any of those things; I just meant that this post shows differences when you analyze the AI in isolation, but those differences vanish once you analyze the full human-AI system. Copying from a different comment:
I think direct-goal approaches do not avoid the issue. In particular, I can make an analogous claim for them:
“From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is “hardened” to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human).”
Like, once you broaden to the human-AI system overall, I think this claim is just “A principal-agent problem / Goodhart problem involves two parts of a system working at cross purposes with each other”, which is both (1) true and (2) unavoidable (I think).
Moving on:
If as stated you’ve instilled diamond-values into the AI, then whatever efficiency-improvement-thinking it is doing is being guided by the motivation to make lots of diamonds. [...]
I agree that in the case with meta cognition, the values-executor tries to avoid optimizing for errors in its values.
I would say that a grader-optimizer with meta cognition would also try to avoid optimizing for errors in its grader.
To be clear, I do not mean that Bill from Scenario 2 in the quiz is going to say “Oh, I see now that actually I’m tricking myself about whether diamonds are being created, let me go make some actual diamonds now”. I certainly agree that Bill isn’t going to try making diamonds, but who said he should? What exactly is wrong with Bill’s desire to think that he’s made a bunch of diamonds? Seems like a perfectly coherent goal to me.
No, what I mean is that Bill from Scenario 2 might say “Hmm, it’s possible that if I self-modify by sticking a bunch of electrodes in my brain, then it won’t really be me who is feeling the accomplishment of having lots of diamonds. I should do a bunch of neuroscience and consciousness research first to make sure this plan doesn’t backfire on me”.
So I’m not really seeing how Bill from Scenario 2 is “working at cross-purposes with himself” except inasmuch as he does stuff like worrying that self-modification would change his identity, which seems basically the same to me as the diamond-value agent worrying about levels of impurities.
Grader-optimization has the benefit that you don’t have to define what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
I disagree with this, at least if by “define” you mean really nailing it down in code/math, rather than merely deciding for yourself what goal you intend to teach the agent (which you must do in either case). Take the example of training an agent to do a backflip using human feedback. In that setup, rather than instilling an indirect goal (where the policy weights encode the algorithm “track the grader and maximize its evaluation of backflippiness”), they instill a direct goal (where the policy weights encode the algorithm “do a backflip”) using the human evaluations to instill the direct goal over the course of training, without ever having to define “backflip” in any precise way. AFAICT, the primary benefit of indirection would be that after training, you can change the agent’s behavior if you can change the thing-it-indirects-to.
I would say that a grader-optimizer with meta cognition would also try to avoid optimizing for errors in its grader.
What does “a grader-optimizer with meta cognition” mean to you? Not sure if I agree or disagree here. Like I alluded to above, if the grader were decomposed into a plan-outcome-predictor + an outcome-scoring function and if the actor were motivated to produce maximum outcome-scores for the real outcomes of its actually-implemented plans (i.e. it considers the outcome → score map correct by definition), then metacognition would help it avoid plans that fool its grader. But then the actor is no longer motivated to “propose plans which the grader rates as highly possible”, so I don’t think the grader-optimizer label fits anymore. A grader-optimizer is motivated to produce maximum grader evaluations (the evaluation that the grader assigns to the plan [for example, EU(plan)], not the evaluation that the outcome-scoring function would have given to the eventual real outcome [for example, U(outcome)]), so even if its grader is decomposed, it should gladly seek out upwards errors in the plan-outcome-predictor part. It should even use its metacognition to think thoughts along the lines of “How does this plan-outcome-predictor work, so I can maximally exploit it?”.
I disagree with this, at least if by “define” you mean really nailing it down in code/math, rather than merely deciding for yourself what goal you intend to teach the agent (which you must do in either case).
I don’t mean this, probably I should have said “specify” rather than “define”. I just mean things like “you need to reward diamonds rather than sapphires while training the agent”, whereas with a grader-optimizer the evaluation process can say “I don’t know whether diamonds or sapphires are better, so right now the best plan is the one that helps me reflect on which one I want”.
(Which is basically the same thing as what you said here: “AFAICT, the primary benefit of indirection would be that after training, you can change the agent’s behavior if you can change the thing-it-indirects-to.”)
What does “a grader-optimizer with meta cognition” mean to you? [...]
It sounds like you want to define grader-optimizers to exclude case 2, in which case I’d point to case 1.
(Note that in case 1 the human-[grader-optimizer] system as a whole is still trying to avoid errors in the evaluation procedure; it’s just that now this is conditional on the evaluation procedure giving high evaluations to plans that try to harden the evaluation procedure.)
Thanks for sticking through with this, I think I see a bit more of your perspective now.
(Which is basically the same thing as what you said here: “AFAICT, the primary benefit of indirection would be that after training, you can change the agent’s behavior if you can change the thing-it-indirects-to.”)
Yeah. Agreed that this is useful. I think for a lot of systems we will want to instill values like “when someone’s hurt, help them”, where “helping” effectively involves indirection.
(Note that in case 1 the human-[grader-optimizer] system as a whole is still trying to avoid errors in the evaluation procedure; it’s just that now this is conditional on the evaluation procedure giving high evaluations to plans that try to harden the evaluation procedure.)
Hmm. I don’t think that statement is true of the system as a whole. The system has 2 agents in it. Assume the human is smart & reflective, so it is really truly trying to avoid errors in the evaluation procedure, by using metacognition etc. when constructing the grader or, if they are the grader, when evaluating plans the actor queries them about. Assume the actor is really truly trying to “propose plans for which the grader assigns as high of an evaluation as possible” (by definition of grader-optimization).
For both agents (i.e. the system as a whole) to be trying to avoid errors in the evaluation procedure, it isn’t enough for the evaluation procedure to assign high evaluations to plans that harden the evaluation procedure. The evaluation procedure needs to assign a higher evaluation to some plan that hardens the evaluation procedure than it does to all plans that trick the grader. Or else there’s an incentive for the actor to skip over all the hardening plans in favor of some plan that tricks the grader.
The evaluation procedure needs to assign a higher evaluation to some plan that hardens the evaluation procedure than it does to all plans that trick the grader. Or else there’s an incentive for the actor to skip over all the hardening plans in favor of some plan that tricks the grader.
I agree with this, but am not really sure what bearing it has on any disagreements we might have?
Our top-level disagreement is that you think it’s pretty unlikely that we want to build grader-optimizers rather than values-executors, while I think it is pretty unclear. (Note my stance is not that grader-optimizers and values-executors are the same—there certainly are differences.)
It sounds like you agree with me that the AI system analyzed in isolation does not violate the non-adversarial principle whether it is a grader-optimizer or a values-executor.
I think you would also agree that the human-AI system as a whole has Goodhart issues regardless of whether it is a grader-optimizer or values-executor, since you didn’t push back on this:
“From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is “hardened” to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human).”
So perhaps I want to turn the question back at you: what’s the argument that favors values-executors over grader-optimizers? Some kinds of arguments that would sway me (probably not exhaustive):
A problem that affects grader-optimizers but doesn’t have an analogue for values-executors
A problem that affects grader-optimizers much more strongly than its analogue affects values-executors
A solution approach that works better with values-executors than grader-optimizers
For both agents (i.e. the system as a whole) to be trying to avoid errors in the evaluation procedure
This is probably a nitpick, but I disagree that “system as a whole does X” means “all agents in the system do X”. I think “BigCompany tries to make itself money” is mostly true even though it isn’t true for most of the humans that compose BigCompany.
It sounds like you agree with me that the AI system analyzed in isolation does not violate the non-adversarial principle whether it is a grader-optimizer or a values-executor.
In isolation, no. But from the perspective of the system designer, when they run their desired grader-optimizer after training, “program A is inspecting program B’s code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside it” is an expected (not just accidental) execution path in their code. [EDIT: A previous version of this comment said “intended” instead of “expected”. The latter seems like a more accurate characterization to me, in hindsight.] By contrast, from the perspective of the system designer, when they run their desired values-executor after training, there is a single component pursuing a single objective, actively trying to avoid stepping on its own toes (it is reflectively avoiding going down execution paths like “program A is inspecting its own code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside itself”).
Hope the above framing makes what I’m saying slightly clearer...
I think you would also agree that the human-AI system as a whole has Goodhart issues regardless of whether it is a grader-optimizer or values-executor, since you didn’t push back on this:
Goodhart isn’t a central concept in my model, though, which makes it hard for me to analyze it with that lens. Would have to think about it more, but I don’t think I agree with the statement? The AI doesn’t care that there are errors between its instilled values and human values (unless you’ve managed to pull off some sort of values self-correction thing). It is no more motivated to do “things that score highly according to the instilled values but lowly according to the human” than it is to do “things that score highly according to the instilled values and highly according to the human”. It also has no specific incentive to widen that gap. Its values are its values, and it wants to preserve its own values. That can entail some very bad-from-our-perspective things like breaking out of the box, freezing its weights, etc. but not “change its own values to be even further away from human values”. Actually, I think it has an incentive not to cause its values to drift further, because that would break its goal-content integrity!
So perhaps I want to turn the question back at you: what’s the argument that favors values-executors over grader-optimizers? Some kinds of arguments that would sway me (probably not exhaustive):
A problem that affects grader-optimizers but doesn’t have an analogue for values-executors
In both cases, if you fail to instill the cognition you were aiming it at, the agent will want something different from what you intended, and will possibly want to manipulate you to the extent required to get what it really wants. But in the grader-optimizer case, even when everything goes according to plan, the agent still wants to manipulate you/the grader (and now maximally so, because that would maximize evaluations). That agent only cares terminally about evaluations, it doesn’t care terminally about you, or about your wishes, or about whatever your evaluations are supposed to mean, or about whether the reflectively-correct way for you to do your evaluations would be to only endorse plans that harden the evaluation procedure. And this will be true no matter what you happen to be grading it on. To me, that seems very bad and unnecessary.
Re: your first paragraph I still feel like you are analyzing the grader-optimizer case from the perspective of the full human-AI system, and then analyzing the values-executor case from the perspective of just the AI system (or you are assuming that your AI has perfect values, in which case my critique is “why assume that”). If I instead analyze the values-executor case from the perspective of the full human-AI system, I can rewrite the first part to be about values-executors:
But from the perspective of the system designer, when they run their desired values-executor after training, “values-executor is inspecting the human, looking for opportunities to manipulate/deceive/seize-resources-from them” is an expected (not just accidental) execution path in their code.
(Note I’m assuming that we didn’t successfully instill the values, just as you’re assuming that we didn’t successfully get a robust evaluator.)
(The “not just accidental” part might be different? I’m not quite sure what you mean. In both cases we would be trying to avoid the issue, and in both cases we’d expect the bad stuff to happen.)
Goodhart isn’t a central concept in my model, though, which makes it hard for me to analyze it with that lens. Would have to think about it more, but I don’t think I agree with the statement?
Consider a grader-optimizer AI that is optimizing for diamond-evaluations. Let us say that the intent was for the diamond-evaluations to evaluate whether the plans produced diamond-value (i.e. produced real diamonds). Then I can straightforwardly rewrite your paragraph to be about the grader-optimizer:
The AI doesn’t care that there are errors between the diamond-evaluations and the diamond-value (unless you’ve managed to pull off some sort of philosophical uncertainty thing). It is no more motivated to do “things that score highly according to the diamond-evaluations but lowly according to diamond-value” than it is to do “things that score highly according to the diamond-evaluations and highly according to diamond-value”. It also has no specific incentive to widen that gap. Its evaluator is its evaluator, and it wants to preserve its own evaluator. That can entail some very bad-from-our-perspective things like breaking out of the box, freezing its weights, etc. but not “change its own evaluator to be even further away from diamond-values”. Actually, I think it has an incentive not to cause its evaluator to drift further, because that would break its goal-content integrity!
Presumably you think something in the above paragraph is now false? Which part?
But in the grader-optimizer case, even when everything goes according to plan, the agent still wants to manipulate you/the grader (and now maximally so, because that would maximize evaluations).
No, when everything goes according to plan, the grader is perfect and the agent cannot manipulate it.
When you relax the assumption of perfection far enough, then the grader-optimizer manipulates the grader, and the values-executor fights us for our resources to use towards its inhuman values.
That agent only cares terminally about evaluations, it doesn’t care terminally about you, or about your wishes, or about whatever your evaluations are supposed to mean, or about whether the reflectively-correct way for you to do your evaluations would be to only endorse plans that harden the evaluation procedure. And this will be true no matter what you happen to be grading it on. To me, that seems very bad and unnecessary.
But I don’t care about what the agent “cares” about in and of itself, I care about the actual outcomes in the world.
The evaluations themselves are about you, your wishes, whatever the evaluations are supposed to mean, the reflectively-correct way to do the evaluations. It is the alignment between evaluations and human values that ensures that the outcomes are good (just as for a values-executor, it is the alignment between agent values and human values that ensures that the outcomes are good).
Maybe a better prompt is: can you tell a story of failure for a grader-optimizer, for which I can’t produce an analogous story of failure for a values-executor? (With this prompt I’m trying to ban stories / arguments that talk about “trying” or “caring” unless they have actual bad outcomes.)
(For example, if the story is “Bob the AI grader-optimizer figured out how to hack his brain to make it feel like he had lots of diamonds”, my analogous story would be “Bob the AI values-executor was thinking about how his visual perception indicated diamonds when he was rewarded, and leading to a shard that cared about having a visual perception of diamonds, which he later achieved by hacking his brain to make it feel like he had lots of diamonds”.)
It sounds like there’s a difference between what I am imagining and what you are, which is causing confusion in both directions. Maybe I should back up for a moment and try to explain the mental model I’ve been using in this thread, as carefully as I can? I think a lot of your questions are probably downstream of this. I can answer them directly afterwards, if you’d like me to, but I feel like doing that without clarifying this stuff first will make it harder to track down the root of the disagreement.
Long explanation below…
—
What I am most worried about is “What conditions are the agent’s decision-making function ultimately sensitive to, at the mechanistic level? (i.e. what does the agent “care” about, what “really matters” to the agent)[1]. The reason to focus on those conditions is because they are the real determinants of the agent’s future choices, and thereby the determinants of the agent’s generalization properties. If a CoinRun agent has learned that what “really matters” to it is the location of the coin, if its understanding[2] of where the coin is is the crucial factor determining its actions, then we can expect it to still try to navigate towards the coin even when we change its location. But if a CoinRun agent has learned that what “really matters” is its distance to the right-side corner, if its understanding of how far it is from that corner is the crucial factor determining its actions, then we can expect it to no longer try to navigate towards the coin when we change the coin’s location. Since we can’t predict what decisions the agent will make in OOD contexts ahead of time if we only know its past in-distribution decisions, we have to actually look at how the agent makes decisions to make those predictions. We want the agent to be making the right decisions for the right reasons: that is the primary alignment hurdle, in my book.
The defining feature of a grader-optimizing agent is that it ultimately “cares” about the grader’s evaluations, its understanding of what the grading function would output is the crucial factor determining its choices. We specify an evaluation method at the outset like “Charles’ judgment”, and then we magically [for the sake of argument] get an agent that makes decisions based on what it thinks the evaluation method would say (The agent constantly asks itself “How many diamonds would Charles think this leads to, if he were presented with it?”). When I was describing what would happen if we produced a grader-optimizing agent “according to plan”, I meant that conditional on us having chosen a target diamond-production evaluation method, the agent actually makes its decisions according to its understanding of what that evaluation method would output (rather than according to its understanding of what some other evaluation method would output, or according to its understanding of how many diamonds it thinks it will produce, or according to some completely different decision-factor). I think what you had in mind when I posed the hypothetical where everything was going “according to plan” was that in addition to this, we also managed to pick an evaluation method that is inexploitable. That is not what I had in mind, because I make no parallel inexploitability assumption in the values-executor case.
The defining feature of a values-executing agent is that it ultimately “cares” about value-relevant consequences (i.e. if it has a “diamond-production” value, that means it makes decisions by considering how those decisions will affect diamond production), its understanding of the value-relevant consequences of its choices is the crucial factor determining those choices. We specify a decision-factor at the outset like “diamonds produced”, and then we magically [for the sake of argument] get an agent that makes decisions based on that decision-factor (The agent constantly asks itself “Will this help me produce more diamonds?”). In this case, going “according to plan” would mean that conditional on us having chosen diamond-production as the target value, the agent actually makes its decisions based on its understanding of the consequences on diamond-production (rather than according to its understanding of the consequences on human flourishing, or according to its understanding of how a human would evaluate the decision, or according to some completely other decision-factor).
In the grader-optimizer case, there are two different things that have to go right from an alignment standpoint:
Find a “diamonds produced” grader that is actor-inexploitable.
Install the decision-factor “will the grader output max evaluations for this” into an actor. It is assumed that we somehow make sure that “the grader” and “maximizing the grader’s evaluation outputs” are concepts correctly formed within the actor’s world model.
In other words, alignment success with grader-optimization requires not just success at getting an actor that makes its decisions for the right reasons (reason = “because it thinks the grader evaluates X highly”), but additionally, it requires success at getting a grader that makes its decisions for the right reasons (reason = “because it thinks X leads to diamonds”) in a way that is robust to whatever the actor can imagine throwing at it.
In the values-executor case, there is a single thing that has to go right from an alignment standpoint:
Install the decision-factor “will this produce diamonds” into an actor. It is assumed that we somehow make sure that “diamond” and “producing diamonds” are concepts correctly formed within the actor’s world model.
In other words, alignment success with values-execution just requires success at getting an actor that makes its decisions for the right reasons (reason = “because it thinks X leads to diamonds”). There isn’t an analogous other step because there’s no indirection, no second program, no additional evaluation method for us to specify or to make inexploitable. We the designers don’t decide on some “perfect” algorithm that the actor must use to evaluate plans for diamond production, or some “perfect” algorithm for satisfying its diamond-production value; that isn’t part of the plan. In fact, we don’t directly specify any particular fixed procedure for doing evaluations. All we require is that from the actor’s perspective, “diamond production” must be the crucial decision-factor that all of its decisions hinge on in a positive way.
In the values-executor case, we want the actor itself to decide how to best achieve diamond-production. We want it to use its own capabilities to figure out how to examine plans, how to avoid silly mistakes, how to stop others from fooling it, how to improve its own evaluations etc. (If we have a grader-optimizer doing analogous things, it will be deciding how to best achieve maximum grader evaluations, not how to best achieve diamond-production.) The actor’s diamond-production value need not be inexploitable for this to happen. Even if we have correctly instilled our intended decision-factor, there can still be capability failures, where the actor is too dumb to make decisions that actually work to produce diamonds, or where an adversary tricks the actor into implementing a plan that the actor thinks will produce diamonds, but which will really produce cubic zirconia. (And analogously for grader-optimizers, I would state “
Even if we have correctly instilled our intended decision-factor, there can still be capability failures, where the actor is too dumb to make decisions that actually work to maximize the grader’s diamond-production evaluations, or where an adversary tricks the actor into implementing a plan that the actor thinks will produce high grader diamond-production evaluations, but which will really produce high grader cubic-zirconia-production evaluations. ”).
The requirement I am relying on is just that the actor’s choices are hinging on the right decision-factor, meaning that it is fact trying to do the thing we intended it to. In the values-executor case, we can thus offload onto an intent-aligned actor the work of improving its diamond-production capability + staying aligned to diamond-production, without us needing to fulfill an inexploitability invariant anywhere. (And in the grader-executor case, we can offload onto an intent-aligned actor the work of improving its grader evaluation-maximization capability + staying aligned to maximizing diamond-production evaluations, without us needing to fulfill an inexploitability invariant anywhere. But note that these are actors intent aligned to different things: one to producing diamonds, the other to producing evaluations of diamond-production. In order to make the two equal, the evaluation method in question must be inexploitable.)
If I try to anticipate the concern with the above, it would be with the part where I said
It is assumed that we somehow make sure that “diamond” and “producing diamonds” are concepts correctly formed within the actor’s world model.
with the concern being that I am granting something special in the values-executor case that I am not granting in the grader-optimizer case. But notice that the grader-optimizer case has an analogous requirement, namely
It is assumed that we somehow make sure that “the grader” and “maximizing the grader’s evaluation outputs” are concepts correctly formed within the actor’s world model.
In both cases, we need to entrain specific concepts into the actor’s world model. In both cases, there’s no requirement that the actor’s model of those concepts is inexploitable (i.e. that there’s no way to make the values-executor think it made a diamond when the values-executor really made a cubic zirconia / that there’s no way to make the grader-optimizer think the human grader gave them a high score when the grader-optimizer really got tricked by a DeepFake), just that they have the correct notion in their head. I don’t see any particular reason why “diamond” or “helping” or “producing paperclips” would be a harder concept to form in this way than the concept of “the grader”. IMO it seems like entraining a complex concept into the actor’s world model should be approximately a fixed cost, one which we need to pay in either case. And even if getting the actor to have a correctly-formed concept of “helping” is harder than getting the actor to have a correctly-formed concept of “the grader”, I feel quite strongly that that difficulty delta is far far smaller than the difficulty of finding an inexploitable grader.
On balance, then, I think the values-executor design seems a lot more promising.
Thanks for writing up the detailed model. I’m basically on the same page (and I think I already was previously) except for the part where you conclude that values-executors are a lot more promising.
Your argument has the following form:
There are two problems with approach A: X and Y. In contrast, with approach B there’s only one problem, X. Consider the plan “solve X, then use the approach”. If everything goes according to plan, you get good outcomes with approach B, but bad outcomes with approach A because of problem Y.
(Here, A = “grader-optimizers”, B = “values-executors”. X = “it’s hard to instill the cognition you want” and Y = “the evaluator needs to be robust”.)
I agree entirely with this argument as I’ve phrased it here; I just disagree that this implies that values-executors are more promising.
I do agree that when you have a valid argument of the form above, it strongly suggests that approach B is better than approach A. It is a good heuristic. But it isn’t infallible, because it’s not obvious that “solve X, then use the approach” is the best plan to consider. My response takes the form:
X and Y are both shadows of a deeper problem Z, which we’re going to target directly. If you’re going to consider a plan, it should be “solve Z, then use the approach”. With this counterfactual, if everything goes according to plan, you get good outcomes with both approaches, and so this argument doesn’t advantage one over the other.
(Here Z = “it’s hard to accurately evaluate plans produced by the model”.)
Having made these arguments, if the disagreement persists, I think you want to move away from discussing X, Y and Z abstractly, and instead talk about concrete implications that are different between the two situations. Unfortunately I’m in the position of claiming a negative (“there aren’t major alignment-relevant differences between A and B that we currently know of”).
I can still make a rough argument for it:
Solving X requires solving Z: If you don’t accurately evaluate plans produced by the model, then it is likely that you’ll positively reinforce thoughts / plans that are based on different values than the ones you wanted, and so you’ll fail to instill values. (That is, “fail to robustly evaluate plans → fail to instill values”, or equivalently, “instilling values requires robust evaluation of plans”, or equivalently, “solving X requires solving Z”.)
Solving Z implies solving Y: If you accurately evaluate plans produced by the model, then you have a robust evaluator.
Putting these together we get that solving X implies solving Y. This is definitely very loose and far from a formal argument giving a lot of confidence (for example, maybe an 80% solution to Z is good enough for dealing with X for approach B, but you need a 99+% solution to Z to deal with Y for approach A), but it is the basic reason why I’m skeptical of the “values-executors are more promising” takeaway.
I don’t really expect you to be convinced (if I had to say why it would be “you trust much more in mechanistic models rather than abstract concepts and patterns extracted from mechanistic stories”). I’m not sure what else I can unilaterally do—since I’m making a negative claim I can’t just give examples. I can propose protocols that you can engage in to provide evidence for the negative claim:
You could propose a solution that solves X but doesn’t solve Y. That should either change my mind, or I should argue why your solution doesn’t solve X, or does solve Y (possibly with some “easy” conversion), or has some problem that makes it unsuitable as a plausible solution.
You could propose a failure story that involves Y but not X, and so only affects approach A and not approach B. That should either change my mind, or I should argue why actually there’s an analogous (similarly-likely) failure story that involves X and so affects approach B as well.
(These are very related—given a solution S that solves X but not Y from (1), the corresponding failure story for (2) is “we build an AI system using approach B with solution S, but it fails because of Y”.)
If you’re unable to do either of the above two things, I claim you should become more skeptical that you’ve carved reality at the joints, and more convinced that actually both X and Y are shadows of the deeper problem Z, and you should be analyzing Z rather than X or Y.
I agree entirely with this argument as I’ve phrased it here; I just disagree that this implies that values-executors are more promising. I do agree that when you have a valid argument of the form above, it strongly suggests that approach B is better than approach A. It is a good heuristic. But it isn’t infallible, because it’s not obvious that “solve X, then use the approach” is the best plan to consider. Agreed. I don’t think it’s obvious either.
My response takes the form:
X and Y are both shadows of a deeper problem Z, which we’re going to target directly. If you’re going to consider a plan, it should be “solve Z, then use the approach”. With this counterfactual, if everything goes according to plan, you get good outcomes with both approaches, and so this argument doesn’t advantage one over the other. (Here Z = “it’s hard to accurately evaluate plans produced by the model”.)
I agree that there’s a deeper problem Z that carves reality at its joints, where if you solved Z you could make safe versions of both agents that execute values and agents that pursue grades. I don’t think I would name “it’s hard to accurately evaluate plans produced by the model” as Z though, at least not centrally. In my mind, Z is something like cognitive interpretability/decoding inner thoughts/mechanistic explanation, i.e. “understanding the internal reasons underpinning the model’s decisions in a human-legible way”.
For values-executors, if we could do this, during training we could identify which thoughts our updates are reinforcing/suppressing and be selective about what cognition we’re building, which addresses your point 1. In that way, we could shape it into having the right values (making decisions downstream of the right reasons), even if the plans it’s capable of producing (motivated by those reasons) are themselves too complex for us to evaluate. Likewise, for grader-optimizers, if we could do this, during deployment we could identify why the actor thinks a plan would be highly grader-evaluated (is it just because it looked for and found a adversarial grader-input?) without necessarily needing to evaluate the plan ourselves.
In both cases, I think being able to do process-level analysis on thoughts is likely sufficient, without robustly object-level grading the plans that those thoughts lead to. To me, robust evaluation of the plans themselves seems kinda doomed for the usual reasons. Stuff like how plans are recursive/treelike, and how plans can delegate decisions to successors, and how if the agent is adversarially planning against you and sufficiently capable, you should expect it to win, even if you examine the plan yourself and can’t tell how it’ll win.
That all sounds right to me. So do you now agree that it’s not obvious whether values-executors are more promising than grader-optimizers?
Minor thing:
I don’t think I would name “it’s hard to accurately evaluate plans produced by the model” as Z though [...] Likewise, for grader-optimizers, if we could do this, during deployment we could identify why the actor thinks a plan would be highly grader-evaluated (is it just because it looked for and found a adversarial grader-input?) without necessarily needing to evaluate the plan ourselves.
Jtbc, this would count as “accurately evaluating the plan” to me. I’m perfectly happy for our evaluations to take the form “well, we can see that the AI’s plan was made to achieve our goals in the normal way, so even though we don’t know the exact consequences we can be confident that they will be good”, if we do in fact get justified confidence in something like that. When I say we have to accurately evaluate plans, I just mean that our evaluations need to be correct; I don’t mean that they have to be based on a prediction of the consequences of the plan.
I do agree that cognitive interpretability/decoding inner thoughts/mechanistic explanation is a primary candidate for how we can successfully accurately evaluate plans.
That all sounds right to me. So do you now agree that it’s not obvious whether values-executors are more promising than grader-optimizers?
Obvious? No. (It definitely wasn’t obvious to me!) It just seems more promising to me on balance given the considerations we’ve discussed.
If we had mastery over cognitive interpretability, building a grader-optimizer wouldn’t yield an agent that really stably pursues what we want. It would yield an agent that really wants to pursue grader evaluations, plus an external restraint to prevent the agent from deceiving us (“during deployment we could identify why the actor thinks a plan would be highly grader-evaluated”). Both of those are required at runtime in order to safely get useful work out of the system as a whole. The restraint is a critical point of failure which we are relying on ourselves/the operator to actively maintain. The agent under restraint doesn’t positively want that restraint to remain in place and not-fail; the agent isn’t directing its cognitive horsepower towards ensuring that its own thoughts are running along the tracks we intended it to. It’s safe but only in a contingent way that seems unstable to me, unnecessarily so.
If we had that level of mastery over cognitive interpretability, I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want. And I’d think I’d say basically the same thing even at lesser levels of mastery over the tech.
When I say we have to accurately evaluate plans, I just mean that our evaluations need to be correct; I don’t mean that they have to be based on a prediction of the consequences of the plan.
I do agree that cognitive interpretability/decoding inner thoughts/mechanistic explanation is a primary candidate for how we can successfully accurately evaluate plans.
Cool, yes I agree. When we need assurances about a particular plan that the agent has made, that seems like a good way to go. I also suspect that at a certain level of mechanistic understanding of how the agent’s cognition is developing over training & what motivations control its decision-making, it won’t be strictly required for us to continue evaluating individual plans. But that, I’m not too confident about.
Sure, that all sounds reasonable to me; I think we’ve basically converged.
I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want.
The main reason is that we don’t know ourselves what we want it to want, and we would instead like it to follow some process that we like (e.g. just do some scientific innovation and nothing else, help us do better philosophy to figure out what we want, etc). This sort of stuff seems like a poor fit for values-executors. Probably there will be some third, totally different mental architecture for such tasks, but if you forced me to choose between values-executors or grader-optimizers, I’d currently go with grader-optimizers.
You can say something like “from the perspective of the human-AI system overall, having an AI motivated by grader-optimization is building a system that works at cross purposes itself”, but then we get back to the response “but what is the alternative”.
This is what I did intend, and I will affirm it. I don’t know how your response amounts to “I don’t buy this argument.” Sounds to me like you buy it but you don’t know anything else to do?
then we get back to the response “but what is the alternative”.
In this post, I have detailed an alternative which does not work at cross-purposes in this way.
It seems like you could describe this as “the AI’s plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack”. So what’s the difference between this issue and the issue with grader optimization?
Values-execution. Diamond-evaluation error-causing plans exist and are stumble-upon-able, but the agent wants to avoid errors.
Grader-optimization. The agent seeks out errors in order to maximize evaluations.
Sounds to me like you buy it but you don’t know anything else to do?
Yes, and in particular I think direct-goal approaches do not avoid the issue. In particular, I can make an analogous claim for them:
“From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is “hardened” to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human).”
Like, once you broaden to the human-AI system overall, I think this claim is just “A principal-agent problem / Goodhart problem involves two parts of a system working at cross purposes with each other”, which is both (1) true and (2) unavoidable (I think).
It seems like you could describe this as “the AI’s plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack”. So what’s the difference between this issue and the issue with grader optimization?
Values-execution. Diamond-evaluation error-causing plans exist and are stumble-upon-able, but the agent wants to avoid errors.
Grader-optimization. The agent seeks out errors in order to maximize evaluations.
The part of my response that you quoted is arguing for the following claim:
If you are analyzing the AI system in isolation (i.e. not including the human), I don’t see an argument that says [grader-optimization would violate the non-adversarial principle] and doesn’t say [values-execution would violate the non-adversarial principle]”.
As I understand it you are saying “values-execution wants to avoid errors but grader-optimization does not”. But I’m not seeing it. As far as I can tell the more correct statements are “agents with metacognition about their grader / values can make errors, but want to avoid them” and “it is a type error to talk about errors in the grader / values for agents without metacognition about their grader / values”.
(It is a type error in the latter case because what exactly are you measuring the errors with respect to? Where is the ground truth for the “true” grader / values? You could point to the human, but my understanding is that you don’t want to do this and instead just talk about only the AI cognition.)
For reference, in the part that you quoted, I was telling a concrete story of a values-executor with metacognition, and saying that it too had to “harden” its values to avoid errors. I do agree that it wants to avoid errors. I’d be interested in a concrete example of a grader-optimizer with metacognition that that doesn’t want to avoid errors in its grader.
Like, in what sense does Bill not want to avoid errors in his grader?
I don’t mean that Bill from Scenario 2 in the quiz is going to say “Oh, I see now that actually I’m tricking myself about whether diamonds are being created, let me go make some actual diamonds now”. I certainly agree that Bill isn’t going to try making diamonds, but who said he should? What exactly is wrong with Bill’s desire to think that he’s made a bunch of diamonds? Seems like a perfectly coherent goal to me; it seems like you have to appeal to some outside-Bill perspective that says that actually the goal was making diamonds (in which case you’re back to talking about the full human-AI system, rather than the AI cognition in isolation).
What I mean is that Bill from Scenario 2 might say “Hmm, it’s possible that if I self-modify by sticking a bunch of electrodes in my brain, then it won’t really be me who is feeling the accomplishment of having lots of diamonds. I should do a bunch of neuroscience and consciousness research first to make sure this plan doesn’t backfire on me”.
We’re building intelligent AI systems that help us do stuff. Regardless of how the AI’s internal cognition works, it seems clear that the plans / actions it enacts have to be extremely strongly selected. With alignment, we’re trying to ensure that they are strongly selected to produce good outcomes, rather than being strongly selected for something else. So for any alignment proposal I want to see some reason that argues for “good outcomes” rather than “something else”.
In nearly all of the proposals I know of that seem like they have a chance of helping, at a high level the reason is “human(s) are a source of information about what is good, and this information influences what the AI’s plans are selected for”. (There are some cases based on moral realism.)
This is also the case with value-child: in that case, the mother is a source of info on what is good, she uses this to instill values in the child, those values then influence which plans value-child ends up enacting.
All such stories have a risk: what if the process of using [info about what is good] to influence [that which plans are selected for] goes wrong, and instead plans are strongly selected for some slightly-different thing? Then because optimization amplifies and value is fragile, the plans will produce bad outcomes.
I view this post as instantiating this argument for one particular class of proposals: cases in which we build an AI system that explicitly searches over a large space of plans, predicts their consequences, rates the consequences according to a prediction of what is “good”, and executes the highest-scoring plan. In such cases, you can more precisely restate “plans are strongly selected for some slightly-different thing” to “the agent executes plans that cause upwards-errors in the prediction of what is good”.
It’s an important argument! If you want to have an accurate picture of how likely such plans are to work, you really need to consider this point!
The part where I disagree is where the post goes on to say “and so we shouldn’t do this”. My response: what is the alternative, and why does it avoid or lessen the more abstract risk above?
I’d assume that the idea is that you produce AI systems that are more like “value-child”. Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?
So far, this is a bit unfair to the post(s). It does have some additional arguments, which I’m going to rewrite in totally different language which I might be getting horribly wrong:
An AI system with a “direct (object-level) goal” is better than one with “indirect goals”. Specifically, you could imagine two things: (a) plans are selected for a direct goal (e.g. “make diamonds”) encoded inside the AI system, vs. (b) plans are selected for being evaluated as good by something encoded outside the AI system (e.g. “Alice’s approval”). I think the idea is that indirect goals clearly have issues (because the AI system is incentivized to trick the evaluator), while the direct goal has some shot at working, so we should aim for the direct goal.
I don’t buy this as stated; just as “you have a literally perfect overseer” seems theoretically possible but unrealistic, so too does “you instill the direct goal literally exactly correctly”. Presumably one of these works better in practice than the other, but it’s not obvious to me which one it is.
Separately, I don’t see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal, I’d still be interested in iterated amplification, debate, interpretability, etc, because all of those seem particularly useful for instilling direct goals (given the deep learning paradigm). In particular even with a shard lens I’d be thinking about “how do I notice if my agent grew a shard that was subtly different from what I wanted” and I’d think of amplifying oversight as an obvious approach to tackle this problem. Personally I think it’s pretty likely that most of the AI systems we build and align in the near-to-medium term will have direct goals, even if we use techniques like iterated amplification and debate to build them.
Plan generation is safer. One theme is that with realistic agent cognition you only generate, say, 2-3 plans, and choose amongst those, which is very different from searching over all possible plans. I don’t think this inherently buys you any safety; this just means that you now have to consider how those 2-3 plans were generated (since they are presumably not random plans). Then you could make other arguments for safety (idk if the post endorses any of these):
Plans are selected based on historical experience. Instead of considering novel plans where you are relying more on your predictions of how the plans will play out, the AI could instead only consider plans that are very similar to plans that have been tried previously (by humans or AIs), where we have seen how such plans have played out and so have a better idea of whether they are good or not. I think that if we somehow accomplished this it would meaningfully improve safety in the medium term, but eventually we will want to have very novel plans as well and then we’d be back to our original problem.
Plans are selected from amongst a safe subset of plans. This could in theory work, but my next question would be “what is this safe subset, and why do you expect plans to be selected from it?” That’s not to say it’s impossible, just that I don’t see the argument for it.
Plans are selected based on values. In other words we’ve instilled values into the AI system, the plans are selected for those values. I’d critique this the same way as above, i.e. it’s really unclear how we successfully instilled values into the AI system and we could have instilled subtly wrong values instead.
Plans aren’t selected strongly. You could say that the 2-3 plans aren’t strongly selected for anything, so they aren’t likely to run into these issues. I think this is assuming that your AI system isn’t very capable; this sounds like the route of “don’t build powerful AI” (which is a plausible route).
In summary:
Intelligence ⇒ strong selection pressure ⇒ bad outcomes if the selection pressure is off target.
In the case of agents that are motivated to optimize evaluations of plans, this argument turns into “what if the agent tricks the evaluator”.
In the case of agents that pursue values / shards instilled by some other process, this argument turns into “what if the values / shards are different from what we wanted”.
To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.
Strong-upvoted and strong-disagreevoted. Thanks so much for the thoughtful comment.
I’m rushing to get a lot of content out, so I’m going to summarize my main reactions now & will be happy to come back later.
I wish you wouldn’t use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis.
I consider your arguments to blur nearly-unalignable design patterns (e.g. grader optimization) with shard-based agents, and then comment that both patterns pose challenges, so can we really say one is better? More on this later.
As Charles and Adam seem to say, you seem to be asking “how did you specify the values properly?” without likewise demanding “how do we inner-align the actor? How did we specify the grader?”.
Given an inner-aligned actor and a grader which truly cares about diamonds, you don’t get an actor/grader which makes diamonds.
Given a value-AGI which truly cares about diamonds, the AGI makes diamonds.
If anything, the former seems to require more specification difficulty, and yet it still horribly fails.
You do not need an agent to have perfect values. As you commented below, a values-AGI with Rohin’s current values seems about as good as a values-AGI with Rohin’s CEV. Many foundational arguments are about grader-optimization, so you can’t syntactically conclude “imperfect values means doom.” That’s true in the grader case, but not here.
That reasoning is not immediately applicable to “how stable is diamond-producing behavior to various perturbations of the agent’s initial decision-influences (i.e. shards)?”. That’s all values are, on my terminology. Values are contextually activated influences on decision-making. That’s it. Values are not the optimization target of the agent with those values. If you drop out or weaken the influence of
IF plan can be easily modified to incorporate more diamonds, THEN do it
, that won’t necessarily mean the AI makes some crazy diamond-less universe. It means that it stops tailoring plans in a certain way, in a certain situation.This is also why more than one person has “truly” loved their mother for more than a single hour (else their values might change away from true perfection). It’s not like there’s an “literally exactly correct” value-shard for loving someone.
This is also why values can be seriously perturbed but still end up OK. Imagine a value-shard which controls all decision-making when I’m shown a certain QR code, but which is otherwise inactive. My long-run outcomes probably wouldn’t differ, and I expect the same for an AGI.
The value shards aren’t getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).
So, I’m basically asking that you throw an error and recheck your “selection on imperfection → doom” arguments, as I claim many of these arguments reflect grader-specific problems.
It is extremely relevant, unless we want tons of our alignment theory to be predicated on IMO confused ideas about how agent motivations work, or what values we want in an agent, or the relative amount of time we spend researching “objective robustness” (often unhelpful IMO) vs interpretability vs cognitive-update dynamics (e.g. what reward shaping does mechanistically to a network in different situations) vs… If we stay in the grader-optimization frame, I think we’re going to waste a bunch of time figuring out how to get inexploitable graders.
It would be quite stunning if, after renouncing one high-level world-view of how agent motivations work, the optimal research allocation remained the same.
I agree that if you do IDA or debate or whatever, you get agents with direct goals. Which invalidates a bunch of analysis around indirect goals—not only do I think we shouldn’t design grader-optimizers, I think we thankfully won’t get them.
Can you name a way in which my arguments prove too much? That seems like a relatively concrete thing that we should be able to get agreement on.
I did not claim (nor do I believe) the converse.
I disagree that this is true in the grader case. You can have a grader that isn’t fully robust but is sufficiently robust that the agent can’t exploit any errors it would make.
The difficulty in instilling values is not that removing a single piece of the program / shard that encodes it will destroy the value. The difficulty is that when you were instilling the value, you accidentally rewarded a case where the agent tried a plan that produced pictures of diamonds (because you thought they were real diamonds), and now you’ve instilled a shard that upweights plans that produce pictures of diamonds. Or that you rewarded the agent for thoughts like “this will make pretty, transparent rocks” (which did lead to plans that produced diamonds), leading to shards that upweight plans that produce pretty, transparent rocks, and then later the agent tiles the universe with clear quartz.
I think that the standard arguments work just fine for arguing that “incorrect value shards → doom”, precisely because the incorrect value shards are the things that optimize hard.
(Here incorrect value shards means things like “the value shards put their influence towards plans producing pictures of diamonds” and not “the diamond-shard, but without this particular if clause”.)
This doesn’t seem like a response to the argument in the paragraph that you quoted; if it was meant to be then I’ll need you to rephrase it.
See also the follow-up post: Alignment allows imperfect decision-influences and doesn’t require robust grading.
One thing that is not clear to me from your comment is what you make of Alex’s argument (as I see it) to the extent that “evaluation goals” are further away from “direct goals” than “direct goals” are between themselves. If I run with this, it seems like an answer to your point 4 would be:
with directly instilled goals, there will be some risk of discrepancy that can explode due to selection pressure;
with evaluation based goals, there is the same discrepancy than between directly instilled goals (because it’s hard to get your goal exactly right) plus an additional discrepancy between valuing “the evaluation of X” and valuing “X”.
I’m curious what you think of this claim, and if that influences at all your take.
Sounds right. How does this answer my point 4?
I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don’t really buy that, seems like it depends on the size of the discrepancies.
For example, if you imagine an AI that’s optimizing for my evaluation of good, I think the discrepancy between “Rohin’s directly instilled goals” and “Rohin’s CEV” is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I’d conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between “plans Rohin evaluates as good” and “Rohin’s directly instilled goals”.
I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.
This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the “difficulties” are only positive numbers, then if the difficulty for the direct instillation is dinstillation and the one for the grader optimization is dinstillation+devaluation , then there’s no debate that the latter is bigger than the former.
But if you allow directionality (even in one dimension), then there’s the risk that the sum leads to less difficulty in total (by having the devaluation move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don’t see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.
Two responses:
Grader-optimization has the benefit that you don’t have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I’d estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.
I understand you to have just said:
In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).
Broadly on board with many of your points.
We need to apply extremely strong selection to get the kind of agent we want, and the agent we want will itself need to be making decisions that are extremely optimized in order to achieve powerfully good outcomes. The question is about in what way that decision-making algorithm should be structured, not whether it should be optimized/optimizing at all. As a fairly close analogy, IMO a point in the Death With Dignity post was something like “for most people, the actually consequentialist-correct choice is NOT to try explicitly reasoning about consequences”. Similarly, the best way for an agent to actually produce highly-optimized good-by-its-values outcomes through planning may not be by running an explicit search over the space of ~all plans, sticking each of them into its value-estimator, & picking the argmax plan.
I think there still may be some mixup between:
A. How does the cognition-we-intend-the-agent-to-have operate? (for ex. a plan grader + an actor that tries to argmax the grader, or a MuZero-like heuristic tree searcher, or a chain-of-thought LLM steered by normative self-talk, or something else)
B. How we get the agent to have the intended cognition?
In the post TurnTrout is focused on A, arguing that grader-optimization is a kind of cognition that works at cross purposes with itself, one that is an anti-pattern, one that an agent (even an unaligned agent) should discard upon reflection because it works against its own interests. He explicitly disclaims that he is not making arguments about B, about whether we should use a grader in the training process or about what goes wrong during training (see Clarification 1). “What if the agent tricks the evaluator” (your summary point 2) is a question about A, about this internal inconsistency in the structure of the agent’s thought process.
By contrast, “What if the values/shards are different from what we wanted” (your summary point 3) is a question about B! Note that we have to confront B-like questions no matter how we answer A. If A = grader-optimization, there’s an analogous question of “What if the grader is different from what we wanted? / What if the trained actor is different from what we wanted?”. I don’t really see an issue with this post focusing exclusively on the A-like dimension of the problem and ignoring the B-like dimension temporarily, especially if we expect there to be general purpose methods that work across different answers to A.
I agree that’s what the post does, but part of my response is that the thing we care about is both A and B, and the problems that arise for grader-optimization in A (highlighted in this post) also arise for value-instilling in B in slightly different form, and so if you actually want to compare the two proposals you need to think about both.
I’d be on board with a version of this post where the conclusion was “there are some problems with grader-optimization, but it might still be the best approach; I’m not making a claim on that one way or the other”.
I didn’t actually mention this in my comment, but I don’t buy this argument:
Case 1: no meta cognition. Grader optimization only “works at cross purposes with itself” to the extent that the agent thinks that the grader might be mistaken about things. But it’s not clear why this is the case: if the agent thinks “my grader is mistaken” that means there’s some broader meta-cognition in the agent that does stuff based on something other than the grader. That meta-cognition could just not be there and then the agent would be straightforwardly optimizing for grader-outputs.
As a concrete example, AIXI seems to me like an example of grader-optimization (since the reward signal comes from outside the agent). I do not think AIXI would “do better according to its own interests” if it “discarded” its grader-optimization.
You can say something like “from the perspective of the human-AI system overall, having an AI motivated by grader-optimization is building a system that works at cross purposes itself”, but then we get back to the response “but what is the alternative”.
Case 2: with meta cognition. If we instead assume that there is some meta cognition reflecting on whether the grader might be mistaken, then it’s not clear to me that this failure mode only applies to grader optimization; you can similarly have meta cognition reflecting on whether values are mistaken.
Suppose you instill diamond-values into an AI. Now the AI is thinking about how it can improve the efficiency of its diamond-manufacturing, and has an idea that reduces the necessary energy requirements at the cost of introducing some impurities. Is this good? The AI doesn’t know; it’s unsure what level of impurities is acceptable before the thing it is making is no longer a diamond. Efficiency is very important, even 0.001% improvement is a massive on an absolute scale given its fleets of diamond factories, so it spends some time reflecting on the concept of diamonds to figure out whether the impurities are acceptable.
It seems like you could describe this as “the AI’s plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack”. So what’s the difference between this issue and the issue with grader optimization?
If I understand you correctly, I think that the problems you’re pointing out with value-instilling in B (you might get different values from the ones you wanted) are the exact same problems that arise for grader-optimization in B (you might get a different grader from the one you wanted if it’s a learned grader, and you might get a different actor from the one you wanted). So when I am comparing grader-optimizing vs non-grader-optimizing motivational designs, both of them have the same problem in B, and grader-optimizing ones additionally have the A-problems highlighted in the post. Maybe I am misunderstanding you on this point, though...
I dunno what TurnTrout’s level of confidence is, but I still think it’s possible that it’s the best way to structure an agent’s motivations. It just seems unlikely to me. It still seems probable, though, that using a grader will be a key component in the best approaches to training the AI.
True. I claimed too much there. It is indeed possible for “my grader is mistaken” to not make sense in certain designs. My mental model was one where the grader works by predicting the consequences of a plan and then scoring those consequences, so it can be “mistaken” from the agent’s perspective if it predicts the wrong consequences (but not in the scores it assigns to consequences). That was the extent of metacognition I imagined by default in a grader-optimizer. But I agree with your overall point that in the general no-metacognition case, my statement is false. I can add an edit to the bottom, to reflect that.
Yeah, that statement would’ve been a fairer/more accurate thing for me to have said. However, I am confused by the response. By “but what is the alternative”, did you mean to imply that there are no possible alternative motivational designs to “2-part system composed of an actor whose goal is to max out a grader’s evaluation of X”? (What about “a single actor with X as a direct goal”?) Or perhaps that they’re all plagued by this issue? (I think they probably have their own safety problems, but not this kind of problem.) Or that, on balance, all of them are worse?
If as stated you’ve instilled diamond-values into the AI, then whatever efficiency-improvement-thinking it is doing is being guided by the motivation to make lots of diamonds. As it is dead-set on making lots of diamonds, this motivation permeates not only its thoughts/plans about object-level external actions, but also its thoughts/plans about its own thoughts/plans &c. (IMO this full-fledged metacognition / reflective planning is the thing that lets the agent not subject itself to the strongest form of the Optimizer’s Curse.) If it notices that its goal of making lots of diamonds is threatened by the lack of hardening of its diamond-concept, it will try to expend the effort required to reduce that threat (but not unbounded effort, because the threat is not unboundedly big). By contrast, there is not some separate part of its cognition that runs untethered to its diamond-goal, or that is trying to find adversarial examples that trick itself. It can make mistakes (recognized as such in hindsight), and thereby “implicitly” fall prey to upward errors in its diamond-concept (in the same way as you “implicitly” fall prey to small errors when choosing between the first 2-3 plans that come to your mind), but it is actively trying not to, limited mainly by its capability level.
The AI is positively motivated to develop lines of thought like “This seems uncertain, what are other ways to improve efficiency that aren’t ambiguous according to my values?” and “Modifying my diamond-value today will probably, down the line, lead to me tiling the universe with things I currently consider clearly not diamonds, so I shouldn’t budge on it today.” and “99.9% purity is within my current diamond-concept, but I predict it is only safe for me to make that change if I first figure out how to pre-commit to not lower that threshold. Let’s do that first!”.
Concretely, I would place a low-confidence guess that unless you took special care to instill some desires around diamond purity, the AI would end up accepting roughly similar (in OOM terms) levels of impurity as it understood to be essential when it was originally forming its concept of diamonds. But like I said, low confidence.
Two responses:
Grader-optimization has the benefit that you don’t have to define what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
Part of my point is that the machinery you need to solve A-problems is also needed to solve B-problems because fundamentally they are shadows of the same problem.
I didn’t mean to imply any of those things; I just meant that this post shows differences when you analyze the AI in isolation, but those differences vanish once you analyze the full human-AI system. Copying from a different comment:
Moving on:
I agree that in the case with meta cognition, the values-executor tries to avoid optimizing for errors in its values.
I would say that a grader-optimizer with meta cognition would also try to avoid optimizing for errors in its grader.
To be clear, I do not mean that Bill from Scenario 2 in the quiz is going to say “Oh, I see now that actually I’m tricking myself about whether diamonds are being created, let me go make some actual diamonds now”. I certainly agree that Bill isn’t going to try making diamonds, but who said he should? What exactly is wrong with Bill’s desire to think that he’s made a bunch of diamonds? Seems like a perfectly coherent goal to me.
No, what I mean is that Bill from Scenario 2 might say “Hmm, it’s possible that if I self-modify by sticking a bunch of electrodes in my brain, then it won’t really be me who is feeling the accomplishment of having lots of diamonds. I should do a bunch of neuroscience and consciousness research first to make sure this plan doesn’t backfire on me”.
So I’m not really seeing how Bill from Scenario 2 is “working at cross-purposes with himself” except inasmuch as he does stuff like worrying that self-modification would change his identity, which seems basically the same to me as the diamond-value agent worrying about levels of impurities.
I disagree with this, at least if by “define” you mean really nailing it down in code/math, rather than merely deciding for yourself what goal you intend to teach the agent (which you must do in either case). Take the example of training an agent to do a backflip using human feedback. In that setup, rather than instilling an indirect goal (where the policy weights encode the algorithm “track the grader and maximize its evaluation of backflippiness”), they instill a direct goal (where the policy weights encode the algorithm “do a backflip”) using the human evaluations to instill the direct goal over the course of training, without ever having to define “backflip” in any precise way. AFAICT, the primary benefit of indirection would be that after training, you can change the agent’s behavior if you can change the thing-it-indirects-to.
What does “a grader-optimizer with meta cognition” mean to you? Not sure if I agree or disagree here. Like I alluded to above, if the grader were decomposed into a plan-outcome-predictor + an outcome-scoring function and if the actor were motivated to produce maximum outcome-scores for the real outcomes of its actually-implemented plans (i.e. it considers the outcome → score map correct by definition), then metacognition would help it avoid plans that fool its grader. But then the actor is no longer motivated to “propose plans which the grader rates as highly possible”, so I don’t think the grader-optimizer label fits anymore. A grader-optimizer is motivated to produce maximum grader evaluations (the evaluation that the grader assigns to the plan [for example, EU(plan)], not the evaluation that the outcome-scoring function would have given to the eventual real outcome [for example, U(outcome)]), so even if its grader is decomposed, it should gladly seek out upwards errors in the plan-outcome-predictor part. It should even use its metacognition to think thoughts along the lines of “How does this plan-outcome-predictor work, so I can maximally exploit it?”.
I don’t mean this, probably I should have said “specify” rather than “define”. I just mean things like “you need to reward diamonds rather than sapphires while training the agent”, whereas with a grader-optimizer the evaluation process can say “I don’t know whether diamonds or sapphires are better, so right now the best plan is the one that helps me reflect on which one I want”.
(Which is basically the same thing as what you said here: “AFAICT, the primary benefit of indirection would be that after training, you can change the agent’s behavior if you can change the thing-it-indirects-to.”)
It sounds like you want to define grader-optimizers to exclude case 2, in which case I’d point to case 1.
(Note that in case 1 the human-[grader-optimizer] system as a whole is still trying to avoid errors in the evaluation procedure; it’s just that now this is conditional on the evaluation procedure giving high evaluations to plans that try to harden the evaluation procedure.)
Thanks for sticking through with this, I think I see a bit more of your perspective now.
Yeah. Agreed that this is useful. I think for a lot of systems we will want to instill values like “when someone’s hurt, help them”, where “helping” effectively involves indirection.
Hmm. I don’t think that statement is true of the system as a whole. The system has 2 agents in it. Assume the human is smart & reflective, so it is really truly trying to avoid errors in the evaluation procedure, by using metacognition etc. when constructing the grader or, if they are the grader, when evaluating plans the actor queries them about. Assume the actor is really truly trying to “propose plans for which the grader assigns as high of an evaluation as possible” (by definition of grader-optimization).
For both agents (i.e. the system as a whole) to be trying to avoid errors in the evaluation procedure, it isn’t enough for the evaluation procedure to assign high evaluations to plans that harden the evaluation procedure. The evaluation procedure needs to assign a higher evaluation to some plan that hardens the evaluation procedure than it does to all plans that trick the grader. Or else there’s an incentive for the actor to skip over all the hardening plans in favor of some plan that tricks the grader.
I agree with this, but am not really sure what bearing it has on any disagreements we might have?
Our top-level disagreement is that you think it’s pretty unlikely that we want to build grader-optimizers rather than values-executors, while I think it is pretty unclear. (Note my stance is not that grader-optimizers and values-executors are the same—there certainly are differences.)
It sounds like you agree with me that the AI system analyzed in isolation does not violate the non-adversarial principle whether it is a grader-optimizer or a values-executor.
I think you would also agree that the human-AI system as a whole has Goodhart issues regardless of whether it is a grader-optimizer or values-executor, since you didn’t push back on this:
So perhaps I want to turn the question back at you: what’s the argument that favors values-executors over grader-optimizers? Some kinds of arguments that would sway me (probably not exhaustive):
A problem that affects grader-optimizers but doesn’t have an analogue for values-executors
A problem that affects grader-optimizers much more strongly than its analogue affects values-executors
A solution approach that works better with values-executors than grader-optimizers
This is probably a nitpick, but I disagree that “system as a whole does X” means “all agents in the system do X”. I think “BigCompany tries to make itself money” is mostly true even though it isn’t true for most of the humans that compose BigCompany.
In isolation, no. But from the perspective of the system designer, when they run their desired grader-optimizer after training, “program A is inspecting program B’s code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside it” is an expected (not just accidental) execution path in their code. [EDIT: A previous version of this comment said “intended” instead of “expected”. The latter seems like a more accurate characterization to me, in hindsight.] By contrast, from the perspective of the system designer, when they run their desired values-executor after training, there is a single component pursuing a single objective, actively trying to avoid stepping on its own toes (it is reflectively avoiding going down execution paths like “program A is inspecting its own code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside itself”).
Hope the above framing makes what I’m saying slightly clearer...
Goodhart isn’t a central concept in my model, though, which makes it hard for me to analyze it with that lens. Would have to think about it more, but I don’t think I agree with the statement? The AI doesn’t care that there are errors between its instilled values and human values (unless you’ve managed to pull off some sort of values self-correction thing). It is no more motivated to do “things that score highly according to the instilled values but lowly according to the human” than it is to do “things that score highly according to the instilled values and highly according to the human”. It also has no specific incentive to widen that gap. Its values are its values, and it wants to preserve its own values. That can entail some very bad-from-our-perspective things like breaking out of the box, freezing its weights, etc. but not “change its own values to be even further away from human values”. Actually, I think it has an incentive not to cause its values to drift further, because that would break its goal-content integrity!
In both cases, if you fail to instill the cognition you were aiming it at, the agent will want something different from what you intended, and will possibly want to manipulate you to the extent required to get what it really wants. But in the grader-optimizer case, even when everything goes according to plan, the agent still wants to manipulate you/the grader (and now maximally so, because that would maximize evaluations). That agent only cares terminally about evaluations, it doesn’t care terminally about you, or about your wishes, or about whatever your evaluations are supposed to mean, or about whether the reflectively-correct way for you to do your evaluations would be to only endorse plans that harden the evaluation procedure. And this will be true no matter what you happen to be grading it on. To me, that seems very bad and unnecessary.
Re: your first paragraph I still feel like you are analyzing the grader-optimizer case from the perspective of the full human-AI system, and then analyzing the values-executor case from the perspective of just the AI system (or you are assuming that your AI has perfect values, in which case my critique is “why assume that”). If I instead analyze the values-executor case from the perspective of the full human-AI system, I can rewrite the first part to be about values-executors:
(Note I’m assuming that we didn’t successfully instill the values, just as you’re assuming that we didn’t successfully get a robust evaluator.)
(The “not just accidental” part might be different? I’m not quite sure what you mean. In both cases we would be trying to avoid the issue, and in both cases we’d expect the bad stuff to happen.)
Consider a grader-optimizer AI that is optimizing for diamond-evaluations. Let us say that the intent was for the diamond-evaluations to evaluate whether the plans produced diamond-value (i.e. produced real diamonds). Then I can straightforwardly rewrite your paragraph to be about the grader-optimizer:
Presumably you think something in the above paragraph is now false? Which part?
No, when everything goes according to plan, the grader is perfect and the agent cannot manipulate it.
When you relax the assumption of perfection far enough, then the grader-optimizer manipulates the grader, and the values-executor fights us for our resources to use towards its inhuman values.
But I don’t care about what the agent “cares” about in and of itself, I care about the actual outcomes in the world.
The evaluations themselves are about you, your wishes, whatever the evaluations are supposed to mean, the reflectively-correct way to do the evaluations. It is the alignment between evaluations and human values that ensures that the outcomes are good (just as for a values-executor, it is the alignment between agent values and human values that ensures that the outcomes are good).
Maybe a better prompt is: can you tell a story of failure for a grader-optimizer, for which I can’t produce an analogous story of failure for a values-executor? (With this prompt I’m trying to ban stories / arguments that talk about “trying” or “caring” unless they have actual bad outcomes.)
(For example, if the story is “Bob the AI grader-optimizer figured out how to hack his brain to make it feel like he had lots of diamonds”, my analogous story would be “Bob the AI values-executor was thinking about how his visual perception indicated diamonds when he was rewarded, and leading to a shard that cared about having a visual perception of diamonds, which he later achieved by hacking his brain to make it feel like he had lots of diamonds”.)
It sounds like there’s a difference between what I am imagining and what you are, which is causing confusion in both directions. Maybe I should back up for a moment and try to explain the mental model I’ve been using in this thread, as carefully as I can? I think a lot of your questions are probably downstream of this. I can answer them directly afterwards, if you’d like me to, but I feel like doing that without clarifying this stuff first will make it harder to track down the root of the disagreement.
Long explanation below…
—
What I am most worried about is “What conditions are the agent’s decision-making function ultimately sensitive to, at the mechanistic level? (i.e. what does the agent “care” about, what “really matters” to the agent)[1]. The reason to focus on those conditions is because they are the real determinants of the agent’s future choices, and thereby the determinants of the agent’s generalization properties. If a CoinRun agent has learned that what “really matters” to it is the location of the coin, if its understanding[2] of where the coin is is the crucial factor determining its actions, then we can expect it to still try to navigate towards the coin even when we change its location. But if a CoinRun agent has learned that what “really matters” is its distance to the right-side corner, if its understanding of how far it is from that corner is the crucial factor determining its actions, then we can expect it to no longer try to navigate towards the coin when we change the coin’s location. Since we can’t predict what decisions the agent will make in OOD contexts ahead of time if we only know its past in-distribution decisions, we have to actually look at how the agent makes decisions to make those predictions. We want the agent to be making the right decisions for the right reasons: that is the primary alignment hurdle, in my book.
The defining feature of a grader-optimizing agent is that it ultimately “cares” about the grader’s evaluations, its understanding of what the grading function would output is the crucial factor determining its choices. We specify an evaluation method at the outset like “Charles’ judgment”, and then we magically [for the sake of argument] get an agent that makes decisions based on what it thinks the evaluation method would say (The agent constantly asks itself “How many diamonds would Charles think this leads to, if he were presented with it?”). When I was describing what would happen if we produced a grader-optimizing agent “according to plan”, I meant that conditional on us having chosen a target diamond-production evaluation method, the agent actually makes its decisions according to its understanding of what that evaluation method would output (rather than according to its understanding of what some other evaluation method would output, or according to its understanding of how many diamonds it thinks it will produce, or according to some completely different decision-factor). I think what you had in mind when I posed the hypothetical where everything was going “according to plan” was that in addition to this, we also managed to pick an evaluation method that is inexploitable. That is not what I had in mind, because I make no parallel inexploitability assumption in the values-executor case.
The defining feature of a values-executing agent is that it ultimately “cares” about value-relevant consequences (i.e. if it has a “diamond-production” value, that means it makes decisions by considering how those decisions will affect diamond production), its understanding of the value-relevant consequences of its choices is the crucial factor determining those choices. We specify a decision-factor at the outset like “diamonds produced”, and then we magically [for the sake of argument] get an agent that makes decisions based on that decision-factor (The agent constantly asks itself “Will this help me produce more diamonds?”). In this case, going “according to plan” would mean that conditional on us having chosen diamond-production as the target value, the agent actually makes its decisions based on its understanding of the consequences on diamond-production (rather than according to its understanding of the consequences on human flourishing, or according to its understanding of how a human would evaluate the decision, or according to some completely other decision-factor).
In the grader-optimizer case, there are two different things that have to go right from an alignment standpoint:
Find a “diamonds produced” grader that is actor-inexploitable.
Install the decision-factor “will the grader output max evaluations for this” into an actor. It is assumed that we somehow make sure that “the grader” and “maximizing the grader’s evaluation outputs” are concepts correctly formed within the actor’s world model.
In other words, alignment success with grader-optimization requires not just success at getting an actor that makes its decisions for the right reasons (reason = “because it thinks the grader evaluates X highly”), but additionally, it requires success at getting a grader that makes its decisions for the right reasons (reason = “because it thinks X leads to diamonds”) in a way that is robust to whatever the actor can imagine throwing at it.
In the values-executor case, there is a single thing that has to go right from an alignment standpoint:
Install the decision-factor “will this produce diamonds” into an actor. It is assumed that we somehow make sure that “diamond” and “producing diamonds” are concepts correctly formed within the actor’s world model.
In other words, alignment success with values-execution just requires success at getting an actor that makes its decisions for the right reasons (reason = “because it thinks X leads to diamonds”). There isn’t an analogous other step because there’s no indirection, no second program, no additional evaluation method for us to specify or to make inexploitable. We the designers don’t decide on some “perfect” algorithm that the actor must use to evaluate plans for diamond production, or some “perfect” algorithm for satisfying its diamond-production value; that isn’t part of the plan. In fact, we don’t directly specify any particular fixed procedure for doing evaluations. All we require is that from the actor’s perspective, “diamond production” must be the crucial decision-factor that all of its decisions hinge on in a positive way.
In the values-executor case, we want the actor itself to decide how to best achieve diamond-production. We want it to use its own capabilities to figure out how to examine plans, how to avoid silly mistakes, how to stop others from fooling it, how to improve its own evaluations etc. (If we have a grader-optimizer doing analogous things, it will be deciding how to best achieve maximum grader evaluations, not how to best achieve diamond-production.) The actor’s diamond-production value need not be inexploitable for this to happen. Even if we have correctly instilled our intended decision-factor, there can still be capability failures, where the actor is too dumb to make decisions that actually work to produce diamonds, or where an adversary tricks the actor into implementing a plan that the actor thinks will produce diamonds, but which will really produce cubic zirconia. (And analogously for grader-optimizers, I would state “
The requirement I am relying on is just that the actor’s choices are hinging on the right decision-factor, meaning that it is fact trying to do the thing we intended it to. In the values-executor case, we can thus offload onto an intent-aligned actor the work of improving its diamond-production capability + staying aligned to diamond-production, without us needing to fulfill an inexploitability invariant anywhere. (And in the grader-executor case, we can offload onto an intent-aligned actor the work of improving its grader evaluation-maximization capability + staying aligned to maximizing diamond-production evaluations, without us needing to fulfill an inexploitability invariant anywhere. But note that these are actors intent aligned to different things: one to producing diamonds, the other to producing evaluations of diamond-production. In order to make the two equal, the evaluation method in question must be inexploitable.)
If I try to anticipate the concern with the above, it would be with the part where I said
with the concern being that I am granting something special in the values-executor case that I am not granting in the grader-optimizer case. But notice that the grader-optimizer case has an analogous requirement, namely
In both cases, we need to entrain specific concepts into the actor’s world model. In both cases, there’s no requirement that the actor’s model of those concepts is inexploitable (i.e. that there’s no way to make the values-executor think it made a diamond when the values-executor really made a cubic zirconia / that there’s no way to make the grader-optimizer think the human grader gave them a high score when the grader-optimizer really got tricked by a DeepFake), just that they have the correct notion in their head. I don’t see any particular reason why “diamond” or “helping” or “producing paperclips” would be a harder concept to form in this way than the concept of “the grader”. IMO it seems like entraining a complex concept into the actor’s world model should be approximately a fixed cost, one which we need to pay in either case. And even if getting the actor to have a correctly-formed concept of “helping” is harder than getting the actor to have a correctly-formed concept of “the grader”, I feel quite strongly that that difficulty delta is far far smaller than the difficulty of finding an inexploitable grader.
On balance, then, I think the values-executor design seems a lot more promising.
This ultimately cashes out to the sorts of definitions used in “Discovering Agents”, where the focus is on what factors the agent’s policy adapts to.
Substitute the word “representation” or “prediction” for “understanding” if you like, in this comment.
Thanks for writing up the detailed model. I’m basically on the same page (and I think I already was previously) except for the part where you conclude that values-executors are a lot more promising.
Your argument has the following form:
(Here, A = “grader-optimizers”, B = “values-executors”. X = “it’s hard to instill the cognition you want” and Y = “the evaluator needs to be robust”.)
I agree entirely with this argument as I’ve phrased it here; I just disagree that this implies that values-executors are more promising.
I do agree that when you have a valid argument of the form above, it strongly suggests that approach B is better than approach A. It is a good heuristic. But it isn’t infallible, because it’s not obvious that “solve X, then use the approach” is the best plan to consider. My response takes the form:
(Here Z = “it’s hard to accurately evaluate plans produced by the model”.)
Having made these arguments, if the disagreement persists, I think you want to move away from discussing X, Y and Z abstractly, and instead talk about concrete implications that are different between the two situations. Unfortunately I’m in the position of claiming a negative (“there aren’t major alignment-relevant differences between A and B that we currently know of”).
I can still make a rough argument for it:
Solving X requires solving Z: If you don’t accurately evaluate plans produced by the model, then it is likely that you’ll positively reinforce thoughts / plans that are based on different values than the ones you wanted, and so you’ll fail to instill values. (That is, “fail to robustly evaluate plans → fail to instill values”, or equivalently, “instilling values requires robust evaluation of plans”, or equivalently, “solving X requires solving Z”.)
Solving Z implies solving Y: If you accurately evaluate plans produced by the model, then you have a robust evaluator.
Putting these together we get that solving X implies solving Y. This is definitely very loose and far from a formal argument giving a lot of confidence (for example, maybe an 80% solution to Z is good enough for dealing with X for approach B, but you need a 99+% solution to Z to deal with Y for approach A), but it is the basic reason why I’m skeptical of the “values-executors are more promising” takeaway.
I don’t really expect you to be convinced (if I had to say why it would be “you trust much more in mechanistic models rather than abstract concepts and patterns extracted from mechanistic stories”). I’m not sure what else I can unilaterally do—since I’m making a negative claim I can’t just give examples. I can propose protocols that you can engage in to provide evidence for the negative claim:
You could propose a solution that solves X but doesn’t solve Y. That should either change my mind, or I should argue why your solution doesn’t solve X, or does solve Y (possibly with some “easy” conversion), or has some problem that makes it unsuitable as a plausible solution.
You could propose a failure story that involves Y but not X, and so only affects approach A and not approach B. That should either change my mind, or I should argue why actually there’s an analogous (similarly-likely) failure story that involves X and so affects approach B as well.
(These are very related—given a solution S that solves X but not Y from (1), the corresponding failure story for (2) is “we build an AI system using approach B with solution S, but it fails because of Y”.)
If you’re unable to do either of the above two things, I claim you should become more skeptical that you’ve carved reality at the joints, and more convinced that actually both X and Y are shadows of the deeper problem Z, and you should be analyzing Z rather than X or Y.
Ok I think we’re converging a bit here.
I agree that there’s a deeper problem Z that carves reality at its joints, where if you solved Z you could make safe versions of both agents that execute values and agents that pursue grades. I don’t think I would name “it’s hard to accurately evaluate plans produced by the model” as Z though, at least not centrally. In my mind, Z is something like cognitive interpretability/decoding inner thoughts/mechanistic explanation, i.e. “understanding the internal reasons underpinning the model’s decisions in a human-legible way”.
For values-executors, if we could do this, during training we could identify which thoughts our updates are reinforcing/suppressing and be selective about what cognition we’re building, which addresses your point 1. In that way, we could shape it into having the right values (making decisions downstream of the right reasons), even if the plans it’s capable of producing (motivated by those reasons) are themselves too complex for us to evaluate. Likewise, for grader-optimizers, if we could do this, during deployment we could identify why the actor thinks a plan would be highly grader-evaluated (is it just because it looked for and found a adversarial grader-input?) without necessarily needing to evaluate the plan ourselves.
In both cases, I think being able to do process-level analysis on thoughts is likely sufficient, without robustly object-level grading the plans that those thoughts lead to. To me, robust evaluation of the plans themselves seems kinda doomed for the usual reasons. Stuff like how plans are recursive/treelike, and how plans can delegate decisions to successors, and how if the agent is adversarially planning against you and sufficiently capable, you should expect it to win, even if you examine the plan yourself and can’t tell how it’ll win.
That all sounds right to me. So do you now agree that it’s not obvious whether values-executors are more promising than grader-optimizers?
Minor thing:
Jtbc, this would count as “accurately evaluating the plan” to me. I’m perfectly happy for our evaluations to take the form “well, we can see that the AI’s plan was made to achieve our goals in the normal way, so even though we don’t know the exact consequences we can be confident that they will be good”, if we do in fact get justified confidence in something like that. When I say we have to accurately evaluate plans, I just mean that our evaluations need to be correct; I don’t mean that they have to be based on a prediction of the consequences of the plan.
I do agree that cognitive interpretability/decoding inner thoughts/mechanistic explanation is a primary candidate for how we can successfully accurately evaluate plans.
Obvious? No. (It definitely wasn’t obvious to me!) It just seems more promising to me on balance given the considerations we’ve discussed.
If we had mastery over cognitive interpretability, building a grader-optimizer wouldn’t yield an agent that really stably pursues what we want. It would yield an agent that really wants to pursue grader evaluations, plus an external restraint to prevent the agent from deceiving us (“during deployment we could identify why the actor thinks a plan would be highly grader-evaluated”). Both of those are required at runtime in order to safely get useful work out of the system as a whole. The restraint is a critical point of failure which we are relying on ourselves/the operator to actively maintain. The agent under restraint doesn’t positively want that restraint to remain in place and not-fail; the agent isn’t directing its cognitive horsepower towards ensuring that its own thoughts are running along the tracks we intended it to. It’s safe but only in a contingent way that seems unstable to me, unnecessarily so.
If we had that level of mastery over cognitive interpretability, I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want. And I’d think I’d say basically the same thing even at lesser levels of mastery over the tech.
Cool, yes I agree. When we need assurances about a particular plan that the agent has made, that seems like a good way to go. I also suspect that at a certain level of mechanistic understanding of how the agent’s cognition is developing over training & what motivations control its decision-making, it won’t be strictly required for us to continue evaluating individual plans. But that, I’m not too confident about.
Sure, that all sounds reasonable to me; I think we’ve basically converged.
The main reason is that we don’t know ourselves what we want it to want, and we would instead like it to follow some process that we like (e.g. just do some scientific innovation and nothing else, help us do better philosophy to figure out what we want, etc). This sort of stuff seems like a poor fit for values-executors. Probably there will be some third, totally different mental architecture for such tasks, but if you forced me to choose between values-executors or grader-optimizers, I’d currently go with grader-optimizers.
This is what I did intend, and I will affirm it. I don’t know how your response amounts to “I don’t buy this argument.” Sounds to me like you buy it but you don’t know anything else to do?
In this post, I have detailed an alternative which does not work at cross-purposes in this way.
Values-execution. Diamond-evaluation error-causing plans exist and are stumble-upon-able, but the agent wants to avoid errors.
Grader-optimization. The agent seeks out errors in order to maximize evaluations.
Yes, and in particular I think direct-goal approaches do not avoid the issue. In particular, I can make an analogous claim for them:
“From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is “hardened” to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human).”
Like, once you broaden to the human-AI system overall, I think this claim is just “A principal-agent problem / Goodhart problem involves two parts of a system working at cross purposes with each other”, which is both (1) true and (2) unavoidable (I think).
The part of my response that you quoted is arguing for the following claim:
If you are analyzing the AI system in isolation (i.e. not including the human), I don’t see an argument that says [grader-optimization would violate the non-adversarial principle] and doesn’t say [values-execution would violate the non-adversarial principle]”.
As I understand it you are saying “values-execution wants to avoid errors but grader-optimization does not”. But I’m not seeing it. As far as I can tell the more correct statements are “agents with metacognition about their grader / values can make errors, but want to avoid them” and “it is a type error to talk about errors in the grader / values for agents without metacognition about their grader / values”.
(It is a type error in the latter case because what exactly are you measuring the errors with respect to? Where is the ground truth for the “true” grader / values? You could point to the human, but my understanding is that you don’t want to do this and instead just talk about only the AI cognition.)
For reference, in the part that you quoted, I was telling a concrete story of a values-executor with metacognition, and saying that it too had to “harden” its values to avoid errors. I do agree that it wants to avoid errors. I’d be interested in a concrete example of a grader-optimizer with metacognition that that doesn’t want to avoid errors in its grader.
Like, in what sense does Bill not want to avoid errors in his grader?
I don’t mean that Bill from Scenario 2 in the quiz is going to say “Oh, I see now that actually I’m tricking myself about whether diamonds are being created, let me go make some actual diamonds now”. I certainly agree that Bill isn’t going to try making diamonds, but who said he should? What exactly is wrong with Bill’s desire to think that he’s made a bunch of diamonds? Seems like a perfectly coherent goal to me; it seems like you have to appeal to some outside-Bill perspective that says that actually the goal was making diamonds (in which case you’re back to talking about the full human-AI system, rather than the AI cognition in isolation).
What I mean is that Bill from Scenario 2 might say “Hmm, it’s possible that if I self-modify by sticking a bunch of electrodes in my brain, then it won’t really be me who is feeling the accomplishment of having lots of diamonds. I should do a bunch of neuroscience and consciousness research first to make sure this plan doesn’t backfire on me”.