Intelligence ⇒ strong selection pressure ⇒ bad outcomes if the selection pressure is off target.
In the case of agents that are motivated to optimize evaluations of plans, this argument turns into “what if the agent tricks the evaluator”.
In the case of agents that pursue values / shards instilled by some other process, this argument turns into “what if the values / shards are different from what we wanted”.
To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.
One thing that is not clear to me from your comment is what you make of Alex’s argument (as I see it) to the extent that “evaluation goals” are further away from “direct goals” than “direct goals” are between themselves. If I run with this, it seems like an answer to your point 4 would be:
with directly instilled goals, there will be some risk of discrepancy that can explode due to selection pressure;
with evaluation based goals, there is the same discrepancy than between directly instilled goals (because it’s hard to get your goal exactly right) plus an additional discrepancy between valuing “the evaluation of X” and valuing “X”.
I’m curious what you think of this claim, and if that influences at all your take.
I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don’t really buy that, seems like it depends on the size of the discrepancies.
For example, if you imagine an AI that’s optimizing for my evaluation of good, I think the discrepancy between “Rohin’s directly instilled goals” and “Rohin’s CEV” is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I’d conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between “plans Rohin evaluates as good” and “Rohin’s directly instilled goals”.
I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.
This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the “difficulties” are only positive numbers, then if the difficulty for the direct instillation is dinstillation and the one for the grader optimization is dinstillation+devaluation , then there’s no debate that the latter is bigger than the former.
But if you allow directionality (even in one dimension), then there’s the risk that the sum leads to less difficulty in total (by having the devaluation move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don’t see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.
Grader-optimization has the benefit that you don’t have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I’d estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.
Having direct-values of “Rohin’s current values” and “Rohin’s CEV” both seem fine. There is, however, significant discrepancy between “grader-optimize the plans Rohin evaluates as good” and “directly-value Rohin’s current values.”
In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
the first line seems to speculate that values-AGI is substantially more robust to differences in values
The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).
One thing that is not clear to me from your comment is what you make of Alex’s argument (as I see it) to the extent that “evaluation goals” are further away from “direct goals” than “direct goals” are between themselves. If I run with this, it seems like an answer to your point 4 would be:
with directly instilled goals, there will be some risk of discrepancy that can explode due to selection pressure;
with evaluation based goals, there is the same discrepancy than between directly instilled goals (because it’s hard to get your goal exactly right) plus an additional discrepancy between valuing “the evaluation of X” and valuing “X”.
I’m curious what you think of this claim, and if that influences at all your take.
Sounds right. How does this answer my point 4?
I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don’t really buy that, seems like it depends on the size of the discrepancies.
For example, if you imagine an AI that’s optimizing for my evaluation of good, I think the discrepancy between “Rohin’s directly instilled goals” and “Rohin’s CEV” is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I’d conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between “plans Rohin evaluates as good” and “Rohin’s directly instilled goals”.
I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.
This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the “difficulties” are only positive numbers, then if the difficulty for the direct instillation is dinstillation and the one for the grader optimization is dinstillation+devaluation , then there’s no debate that the latter is bigger than the former.
But if you allow directionality (even in one dimension), then there’s the risk that the sum leads to less difficulty in total (by having the devaluation move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don’t see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.
Two responses:
Grader-optimization has the benefit that you don’t have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I’d estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.
I understand you to have just said:
In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).