Thanks for writing up the detailed model. I’m basically on the same page (and I think I already was previously) except for the part where you conclude that values-executors are a lot more promising.
Your argument has the following form:
There are two problems with approach A: X and Y. In contrast, with approach B there’s only one problem, X. Consider the plan “solve X, then use the approach”. If everything goes according to plan, you get good outcomes with approach B, but bad outcomes with approach A because of problem Y.
(Here, A = “grader-optimizers”, B = “values-executors”. X = “it’s hard to instill the cognition you want” and Y = “the evaluator needs to be robust”.)
I agree entirely with this argument as I’ve phrased it here; I just disagree that this implies that values-executors are more promising.
I do agree that when you have a valid argument of the form above, it strongly suggests that approach B is better than approach A. It is a good heuristic. But it isn’t infallible, because it’s not obvious that “solve X, then use the approach” is the best plan to consider. My response takes the form:
X and Y are both shadows of a deeper problem Z, which we’re going to target directly. If you’re going to consider a plan, it should be “solve Z, then use the approach”. With this counterfactual, if everything goes according to plan, you get good outcomes with both approaches, and so this argument doesn’t advantage one over the other.
(Here Z = “it’s hard to accurately evaluate plans produced by the model”.)
Having made these arguments, if the disagreement persists, I think you want to move away from discussing X, Y and Z abstractly, and instead talk about concrete implications that are different between the two situations. Unfortunately I’m in the position of claiming a negative (“there aren’t major alignment-relevant differences between A and B that we currently know of”).
I can still make a rough argument for it:
Solving X requires solving Z: If you don’t accurately evaluate plans produced by the model, then it is likely that you’ll positively reinforce thoughts / plans that are based on different values than the ones you wanted, and so you’ll fail to instill values. (That is, “fail to robustly evaluate plans → fail to instill values”, or equivalently, “instilling values requires robust evaluation of plans”, or equivalently, “solving X requires solving Z”.)
Solving Z implies solving Y: If you accurately evaluate plans produced by the model, then you have a robust evaluator.
Putting these together we get that solving X implies solving Y. This is definitely very loose and far from a formal argument giving a lot of confidence (for example, maybe an 80% solution to Z is good enough for dealing with X for approach B, but you need a 99+% solution to Z to deal with Y for approach A), but it is the basic reason why I’m skeptical of the “values-executors are more promising” takeaway.
I don’t really expect you to be convinced (if I had to say why it would be “you trust much more in mechanistic models rather than abstract concepts and patterns extracted from mechanistic stories”). I’m not sure what else I can unilaterally do—since I’m making a negative claim I can’t just give examples. I can propose protocols that you can engage in to provide evidence for the negative claim:
You could propose a solution that solves X but doesn’t solve Y. That should either change my mind, or I should argue why your solution doesn’t solve X, or does solve Y (possibly with some “easy” conversion), or has some problem that makes it unsuitable as a plausible solution.
You could propose a failure story that involves Y but not X, and so only affects approach A and not approach B. That should either change my mind, or I should argue why actually there’s an analogous (similarly-likely) failure story that involves X and so affects approach B as well.
(These are very related—given a solution S that solves X but not Y from (1), the corresponding failure story for (2) is “we build an AI system using approach B with solution S, but it fails because of Y”.)
If you’re unable to do either of the above two things, I claim you should become more skeptical that you’ve carved reality at the joints, and more convinced that actually both X and Y are shadows of the deeper problem Z, and you should be analyzing Z rather than X or Y.
I agree entirely with this argument as I’ve phrased it here; I just disagree that this implies that values-executors are more promising. I do agree that when you have a valid argument of the form above, it strongly suggests that approach B is better than approach A. It is a good heuristic. But it isn’t infallible, because it’s not obvious that “solve X, then use the approach” is the best plan to consider. Agreed. I don’t think it’s obvious either.
My response takes the form:
X and Y are both shadows of a deeper problem Z, which we’re going to target directly. If you’re going to consider a plan, it should be “solve Z, then use the approach”. With this counterfactual, if everything goes according to plan, you get good outcomes with both approaches, and so this argument doesn’t advantage one over the other. (Here Z = “it’s hard to accurately evaluate plans produced by the model”.)
I agree that there’s a deeper problem Z that carves reality at its joints, where if you solved Z you could make safe versions of both agents that execute values and agents that pursue grades. I don’t think I would name “it’s hard to accurately evaluate plans produced by the model” as Z though, at least not centrally. In my mind, Z is something like cognitive interpretability/decoding inner thoughts/mechanistic explanation, i.e. “understanding the internal reasons underpinning the model’s decisions in a human-legible way”.
For values-executors, if we could do this, during training we could identify which thoughts our updates are reinforcing/suppressing and be selective about what cognition we’re building, which addresses your point 1. In that way, we could shape it into having the right values (making decisions downstream of the right reasons), even if the plans it’s capable of producing (motivated by those reasons) are themselves too complex for us to evaluate. Likewise, for grader-optimizers, if we could do this, during deployment we could identify why the actor thinks a plan would be highly grader-evaluated (is it just because it looked for and found a adversarial grader-input?) without necessarily needing to evaluate the plan ourselves.
In both cases, I think being able to do process-level analysis on thoughts is likely sufficient, without robustly object-level grading the plans that those thoughts lead to. To me, robust evaluation of the plans themselves seems kinda doomed for the usual reasons. Stuff like how plans are recursive/treelike, and how plans can delegate decisions to successors, and how if the agent is adversarially planning against you and sufficiently capable, you should expect it to win, even if you examine the plan yourself and can’t tell how it’ll win.
That all sounds right to me. So do you now agree that it’s not obvious whether values-executors are more promising than grader-optimizers?
Minor thing:
I don’t think I would name “it’s hard to accurately evaluate plans produced by the model” as Z though [...] Likewise, for grader-optimizers, if we could do this, during deployment we could identify why the actor thinks a plan would be highly grader-evaluated (is it just because it looked for and found a adversarial grader-input?) without necessarily needing to evaluate the plan ourselves.
Jtbc, this would count as “accurately evaluating the plan” to me. I’m perfectly happy for our evaluations to take the form “well, we can see that the AI’s plan was made to achieve our goals in the normal way, so even though we don’t know the exact consequences we can be confident that they will be good”, if we do in fact get justified confidence in something like that. When I say we have to accurately evaluate plans, I just mean that our evaluations need to be correct; I don’t mean that they have to be based on a prediction of the consequences of the plan.
I do agree that cognitive interpretability/decoding inner thoughts/mechanistic explanation is a primary candidate for how we can successfully accurately evaluate plans.
That all sounds right to me. So do you now agree that it’s not obvious whether values-executors are more promising than grader-optimizers?
Obvious? No. (It definitely wasn’t obvious to me!) It just seems more promising to me on balance given the considerations we’ve discussed.
If we had mastery over cognitive interpretability, building a grader-optimizer wouldn’t yield an agent that really stably pursues what we want. It would yield an agent that really wants to pursue grader evaluations, plus an external restraint to prevent the agent from deceiving us (“during deployment we could identify why the actor thinks a plan would be highly grader-evaluated”). Both of those are required at runtime in order to safely get useful work out of the system as a whole. The restraint is a critical point of failure which we are relying on ourselves/the operator to actively maintain. The agent under restraint doesn’t positively want that restraint to remain in place and not-fail; the agent isn’t directing its cognitive horsepower towards ensuring that its own thoughts are running along the tracks we intended it to. It’s safe but only in a contingent way that seems unstable to me, unnecessarily so.
If we had that level of mastery over cognitive interpretability, I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want. And I’d think I’d say basically the same thing even at lesser levels of mastery over the tech.
When I say we have to accurately evaluate plans, I just mean that our evaluations need to be correct; I don’t mean that they have to be based on a prediction of the consequences of the plan.
I do agree that cognitive interpretability/decoding inner thoughts/mechanistic explanation is a primary candidate for how we can successfully accurately evaluate plans.
Cool, yes I agree. When we need assurances about a particular plan that the agent has made, that seems like a good way to go. I also suspect that at a certain level of mechanistic understanding of how the agent’s cognition is developing over training & what motivations control its decision-making, it won’t be strictly required for us to continue evaluating individual plans. But that, I’m not too confident about.
Sure, that all sounds reasonable to me; I think we’ve basically converged.
I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want.
The main reason is that we don’t know ourselves what we want it to want, and we would instead like it to follow some process that we like (e.g. just do some scientific innovation and nothing else, help us do better philosophy to figure out what we want, etc). This sort of stuff seems like a poor fit for values-executors. Probably there will be some third, totally different mental architecture for such tasks, but if you forced me to choose between values-executors or grader-optimizers, I’d currently go with grader-optimizers.
Thanks for writing up the detailed model. I’m basically on the same page (and I think I already was previously) except for the part where you conclude that values-executors are a lot more promising.
Your argument has the following form:
(Here, A = “grader-optimizers”, B = “values-executors”. X = “it’s hard to instill the cognition you want” and Y = “the evaluator needs to be robust”.)
I agree entirely with this argument as I’ve phrased it here; I just disagree that this implies that values-executors are more promising.
I do agree that when you have a valid argument of the form above, it strongly suggests that approach B is better than approach A. It is a good heuristic. But it isn’t infallible, because it’s not obvious that “solve X, then use the approach” is the best plan to consider. My response takes the form:
(Here Z = “it’s hard to accurately evaluate plans produced by the model”.)
Having made these arguments, if the disagreement persists, I think you want to move away from discussing X, Y and Z abstractly, and instead talk about concrete implications that are different between the two situations. Unfortunately I’m in the position of claiming a negative (“there aren’t major alignment-relevant differences between A and B that we currently know of”).
I can still make a rough argument for it:
Solving X requires solving Z: If you don’t accurately evaluate plans produced by the model, then it is likely that you’ll positively reinforce thoughts / plans that are based on different values than the ones you wanted, and so you’ll fail to instill values. (That is, “fail to robustly evaluate plans → fail to instill values”, or equivalently, “instilling values requires robust evaluation of plans”, or equivalently, “solving X requires solving Z”.)
Solving Z implies solving Y: If you accurately evaluate plans produced by the model, then you have a robust evaluator.
Putting these together we get that solving X implies solving Y. This is definitely very loose and far from a formal argument giving a lot of confidence (for example, maybe an 80% solution to Z is good enough for dealing with X for approach B, but you need a 99+% solution to Z to deal with Y for approach A), but it is the basic reason why I’m skeptical of the “values-executors are more promising” takeaway.
I don’t really expect you to be convinced (if I had to say why it would be “you trust much more in mechanistic models rather than abstract concepts and patterns extracted from mechanistic stories”). I’m not sure what else I can unilaterally do—since I’m making a negative claim I can’t just give examples. I can propose protocols that you can engage in to provide evidence for the negative claim:
You could propose a solution that solves X but doesn’t solve Y. That should either change my mind, or I should argue why your solution doesn’t solve X, or does solve Y (possibly with some “easy” conversion), or has some problem that makes it unsuitable as a plausible solution.
You could propose a failure story that involves Y but not X, and so only affects approach A and not approach B. That should either change my mind, or I should argue why actually there’s an analogous (similarly-likely) failure story that involves X and so affects approach B as well.
(These are very related—given a solution S that solves X but not Y from (1), the corresponding failure story for (2) is “we build an AI system using approach B with solution S, but it fails because of Y”.)
If you’re unable to do either of the above two things, I claim you should become more skeptical that you’ve carved reality at the joints, and more convinced that actually both X and Y are shadows of the deeper problem Z, and you should be analyzing Z rather than X or Y.
Ok I think we’re converging a bit here.
I agree that there’s a deeper problem Z that carves reality at its joints, where if you solved Z you could make safe versions of both agents that execute values and agents that pursue grades. I don’t think I would name “it’s hard to accurately evaluate plans produced by the model” as Z though, at least not centrally. In my mind, Z is something like cognitive interpretability/decoding inner thoughts/mechanistic explanation, i.e. “understanding the internal reasons underpinning the model’s decisions in a human-legible way”.
For values-executors, if we could do this, during training we could identify which thoughts our updates are reinforcing/suppressing and be selective about what cognition we’re building, which addresses your point 1. In that way, we could shape it into having the right values (making decisions downstream of the right reasons), even if the plans it’s capable of producing (motivated by those reasons) are themselves too complex for us to evaluate. Likewise, for grader-optimizers, if we could do this, during deployment we could identify why the actor thinks a plan would be highly grader-evaluated (is it just because it looked for and found a adversarial grader-input?) without necessarily needing to evaluate the plan ourselves.
In both cases, I think being able to do process-level analysis on thoughts is likely sufficient, without robustly object-level grading the plans that those thoughts lead to. To me, robust evaluation of the plans themselves seems kinda doomed for the usual reasons. Stuff like how plans are recursive/treelike, and how plans can delegate decisions to successors, and how if the agent is adversarially planning against you and sufficiently capable, you should expect it to win, even if you examine the plan yourself and can’t tell how it’ll win.
That all sounds right to me. So do you now agree that it’s not obvious whether values-executors are more promising than grader-optimizers?
Minor thing:
Jtbc, this would count as “accurately evaluating the plan” to me. I’m perfectly happy for our evaluations to take the form “well, we can see that the AI’s plan was made to achieve our goals in the normal way, so even though we don’t know the exact consequences we can be confident that they will be good”, if we do in fact get justified confidence in something like that. When I say we have to accurately evaluate plans, I just mean that our evaluations need to be correct; I don’t mean that they have to be based on a prediction of the consequences of the plan.
I do agree that cognitive interpretability/decoding inner thoughts/mechanistic explanation is a primary candidate for how we can successfully accurately evaluate plans.
Obvious? No. (It definitely wasn’t obvious to me!) It just seems more promising to me on balance given the considerations we’ve discussed.
If we had mastery over cognitive interpretability, building a grader-optimizer wouldn’t yield an agent that really stably pursues what we want. It would yield an agent that really wants to pursue grader evaluations, plus an external restraint to prevent the agent from deceiving us (“during deployment we could identify why the actor thinks a plan would be highly grader-evaluated”). Both of those are required at runtime in order to safely get useful work out of the system as a whole. The restraint is a critical point of failure which we are relying on ourselves/the operator to actively maintain. The agent under restraint doesn’t positively want that restraint to remain in place and not-fail; the agent isn’t directing its cognitive horsepower towards ensuring that its own thoughts are running along the tracks we intended it to. It’s safe but only in a contingent way that seems unstable to me, unnecessarily so.
If we had that level of mastery over cognitive interpretability, I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want. And I’d think I’d say basically the same thing even at lesser levels of mastery over the tech.
Cool, yes I agree. When we need assurances about a particular plan that the agent has made, that seems like a good way to go. I also suspect that at a certain level of mechanistic understanding of how the agent’s cognition is developing over training & what motivations control its decision-making, it won’t be strictly required for us to continue evaluating individual plans. But that, I’m not too confident about.
Sure, that all sounds reasonable to me; I think we’ve basically converged.
The main reason is that we don’t know ourselves what we want it to want, and we would instead like it to follow some process that we like (e.g. just do some scientific innovation and nothing else, help us do better philosophy to figure out what we want, etc). This sort of stuff seems like a poor fit for values-executors. Probably there will be some third, totally different mental architecture for such tasks, but if you forced me to choose between values-executors or grader-optimizers, I’d currently go with grader-optimizers.