Imagine someone who considers a few plans, grades them (e.g. “how good does my gut say this plan is?”), and chooses the best. They are not a grader-optimizer. They are not trying to navigate to the state where they propose and execute a plan which gets maximally highly rated by some evaluative submodule. They use a grading procedure to locally rate and execute plans, and may even locally think “what would make me feel better about this plan?”, but the point of their optimization isn’t “find the plan which makes me feel as good as globally possible.”
The way I think about this situation for myself as a human is that the more plans I consider and the wider / more global my search process is, the more likely it is that I hit upon an especially good “out of the box” plan, but also the more likely it is that I hit upon some “adversarial input” (in quotes because I’m not sure what you or I mean by this) and end up doing something really bad. It seems there are two things I can do about this:
Try to intuitively or quantitatively optimize the search process itself, as far as how many plans to consider, where to direct the search, etc., to get the best trade off between the two outcomes.
Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.
Do you have any objections/disagreements with this? Secondarily, if as a result of 1 and 2 I’m doing a fairly wide search and considering many plans, doesn’t it stop making sense at some point to say “They are not a grader-optimizer.”?
This includes “What would this specific and superintelligent CEV-universe-simulation say about this plan?”.
This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]
But isn’t 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = ‘run X then shuts itself off without doing anything else’ (by doing a simple text match), 0 otherwise, so there’s no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it’s saying that 2 is safer/better than 1.
Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.
Improve it with respect to what?
My attempt at a framework where “improving one’s own evaluator” and “believing in adversarial examples to one’s own evaluator” make sense:
The agent’s allegiance is to some idealized utility function Uideal (like CEV). The agent’s internal evaluator Eval is “trying” to approximate Uideal by reasoning heuristically. So now we ask Eval to evaluate the plan “do argmax w.r.t. Eval over a bunch of plans”. Eval reasons that, due to the the way that Eval works, there should exist “adversarial examples” that score very highly on Eval but low on Uideal. Hence, Eval concludes that Uideal(plan) is low, where plan = “do argmax w.r.t. Eval”. So the agent doesn’t execute the plan “search widely and argmax”.
“Improving Eval” makes sense because Eval will gladly replace itself with Eval2 if it believes that Eval2 is a better approximation for Uideal (and hence replacing itself will cause the outcome to score better on Uideal)
Are there other distinct frameworks which make sense here? I look forward to seeing what design Alex proposes for “value child”.
This is tempting, but the problem is that I don’t know what my idealized utility function is (e.g., I don’t have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it’s a bad idea, but how does that fit into the framework?
My own framework is something like this:
The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
I think there are “adversarial inputs” because I’ve previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas.
I can try to improve my evaluation process by doing things like
look for patterns in my and other people’s mistakes
think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses
do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)
talk (selectively) to other people
try to improve how I do explicit reasoning or philosophy
A simple framework (that probably isn’t strictly distinct from the one you mentioned) would be that the agent has a foresight evaluation method that estimates “How good do I think this plan is?” and a hindsight evaluation method that calculates “How good was it, really?”. There can be plans that trick the foresight evaluation method relative to the hindsight one. For example, I can get tricked into thinking some outcome is more likely than it actually is (“The chances of losing my client’s money with this investment strategy were way higher than I thought they were.”) or thinking that some new state will be hindsight-evaluated better than it actually will be (“He convinced me that if I tried coffee, I would like it, but I just drank it and it tastes disgusting.”), etc.
The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:
Do you think that you are internally trying to approximate your own Uideal?
Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don’t trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
Can you think of concrete instances where you improved your own Eval?
Can you think of concrete instances where you thought you improved you own Eval but then regretted it later?
Do you think that your own changes to your eval have been moving in the direction of your Uideal?
Do you have any objections/disagreements with this? Secondarily, if as a result of 1 and 2 I’m doing a fairly wide search and considering many plans, doesn’t it stop making sense at some point to say “They are not a grader-optimizer.”?
I wrote in the post:
Value-child comes up with a hard-work plan as an effect of his cognition, not as a motivating cause—not because he only wants to come up with plans he himself will rate highly. He values working hard.
As a corollary, grader-optimization is not synonymous with planning. Grader-optimization is when high plan-evaluations are the motivating cause of planning, where “I found a plan which I think leads to diamond” is the terminal goal, and not just a side effect of cognition (as it is for values-child).
Sorry if I’m just repeating something you read and understood, but I do feel like this criterion answers “no, this is still not grader-optimization; the effective search over lots of plans is still a side-effect of your cognition, not the terminal end.”
In particular, note that the strategy you described would not strongly want to be given the actual-highest-rated plan—or maybe it would want to know more about the plan as a curiosity, but not in order to evaluate and execute that plan. That’s one way in which saying “your strategy is not grader-optimization” constrains my anticipations in a useful-seeming way.
But isn’t 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = ‘run X then shuts itself off without doing anything else’ (by doing a simple text match), 0 otherwise, so there’s no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it’s saying that 2 is safer/better than 1.
This is a good point. I’m wondering about the type of the (presumably Cartesian) interface between the CEV-sim and the actor. First, CEV-sim shouldn’t be affectable by the input-plan unless and until they run some stats on it. Otherwise the actor could (maybe?) side-channel attack them via whatever computer registers the input-plan shows up in. And CEV-sim does have to infer what they’re being used for, at each invocation of the grader (since they don’t retain memory across counterfactuals).
a. That aside, if CEV-sim can just syntactically check whether the input-plan runs that, then your argument seems good.
b. If CEV-sim has to understand the actor’s latent state context (is it all in the plan?), in order to make sure that the purported X-running plan isn’t just running dangerous Y in another programming language… Seems like they can’t do this.
I feel like we’re dealing with (b) more than (a), so I’d say “no, 2 is safer than 1” tentatively.
> This includes “What would this specific and superintelligent CEV-universe-simulation say about this plan?”.
> This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]
But isn’t 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = ‘run X then shuts itself off without doing anything else’ (by doing a simple text match), 0 otherwise, so there’s no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it’s saying that 2 is safer/better than 1.
Is your issue here that there exist a specific CEV-universe-simulation that makes 1 just as safe as 2, by basically emulating the latter situation? If so, why do you think this is a point against Alex’s claim(which strikes me more as saying “there are a lot more cases of 2. being safe than of 1.”)?
The way I think about this situation for myself as a human is that the more plans I consider and the wider / more global my search process is, the more likely it is that I hit upon an especially good “out of the box” plan, but also the more likely it is that I hit upon some “adversarial input” (in quotes because I’m not sure what you or I mean by this) and end up doing something really bad. It seems there are two things I can do about this:
Try to intuitively or quantitatively optimize the search process itself, as far as how many plans to consider, where to direct the search, etc., to get the best trade off between the two outcomes.
Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.
Do you have any objections/disagreements with this? Secondarily, if as a result of 1 and 2 I’m doing a fairly wide search and considering many plans, doesn’t it stop making sense at some point to say “They are not a grader-optimizer.”?
But isn’t 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = ‘run X then shuts itself off without doing anything else’ (by doing a simple text match), 0 otherwise, so there’s no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it’s saying that 2 is safer/better than 1.
Improve it with respect to what?
My attempt at a framework where “improving one’s own evaluator” and “believing in adversarial examples to one’s own evaluator” make sense:
The agent’s allegiance is to some idealized utility function Uideal (like CEV). The agent’s internal evaluator Eval is “trying” to approximate Uideal by reasoning heuristically. So now we ask Eval to evaluate the plan “do argmax w.r.t. Eval over a bunch of plans”. Eval reasons that, due to the the way that Eval works, there should exist “adversarial examples” that score very highly on Eval but low on Uideal. Hence, Eval concludes that Uideal(plan) is low, where plan = “do argmax w.r.t. Eval”. So the agent doesn’t execute the plan “search widely and argmax”.
“Improving Eval” makes sense because Eval will gladly replace itself with Eval2 if it believes that Eval2 is a better approximation for Uideal (and hence replacing itself will cause the outcome to score better on Uideal)
Are there other distinct frameworks which make sense here? I look forward to seeing what design Alex proposes for “value child”.
This is tempting, but the problem is that I don’t know what my idealized utility function is (e.g., I don’t have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it’s a bad idea, but how does that fit into the framework?
My own framework is something like this:
The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
I think there are “adversarial inputs” because I’ve previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas.
I can try to improve my evaluation process by doing things like
look for patterns in my and other people’s mistakes
think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses
do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)
talk (selectively) to other people
try to improve how I do explicit reasoning or philosophy
Yeah I think you’re on the right track.
A simple framework (that probably isn’t strictly distinct from the one you mentioned) would be that the agent has a foresight evaluation method that estimates “How good do I think this plan is?” and a hindsight evaluation method that calculates “How good was it, really?”. There can be plans that trick the foresight evaluation method relative to the hindsight one. For example, I can get tricked into thinking some outcome is more likely than it actually is (“The chances of losing my client’s money with this investment strategy were way higher than I thought they were.”) or thinking that some new state will be hindsight-evaluated better than it actually will be (“He convinced me that if I tried coffee, I would like it, but I just drank it and it tastes disgusting.”), etc.
The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:
Do you think that you are internally trying to approximate your own Uideal?
Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don’t trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
Can you think of concrete instances where you improved your own Eval?
Can you think of concrete instances where you thought you improved you own Eval but then regretted it later?
Do you think that your own changes to your eval have been moving in the direction of your Uideal?
Vivek—I replied to your comment in appendix C of today’s follow-up post, Alignment allows imperfect decision-influences and doesn’t require robust grading.
I wrote in the post:
Sorry if I’m just repeating something you read and understood, but I do feel like this criterion answers “no, this is still not grader-optimization; the effective search over lots of plans is still a side-effect of your cognition, not the terminal end.”
In particular, note that the strategy you described would not strongly want to be given the actual-highest-rated plan—or maybe it would want to know more about the plan as a curiosity, but not in order to evaluate and execute that plan. That’s one way in which saying “your strategy is not grader-optimization” constrains my anticipations in a useful-seeming way.
This is a good point. I’m wondering about the type of the (presumably Cartesian) interface between the CEV-sim and the actor. First, CEV-sim shouldn’t be affectable by the input-plan unless and until they run some stats on it. Otherwise the actor could (maybe?) side-channel attack them via whatever computer registers the input-plan shows up in. And CEV-sim does have to infer what they’re being used for, at each invocation of the grader (since they don’t retain memory across counterfactuals).
a. That aside, if CEV-sim can just syntactically check whether the input-plan runs that, then your argument seems good.
b. If CEV-sim has to understand the actor’s latent state context (is it all in the plan?), in order to make sure that the purported X-running plan isn’t just running dangerous Y in another programming language… Seems like they can’t do this.
I feel like we’re dealing with (b) more than (a), so I’d say “no, 2 is safer than 1” tentatively.
Is your issue here that there exist a specific CEV-universe-simulation that makes 1 just as safe as 2, by basically emulating the latter situation? If so, why do you think this is a point against Alex’s claim(which strikes me more as saying “there are a lot more cases of 2. being safe than of 1.”)?