I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don’t really buy that, seems like it depends on the size of the discrepancies.
For example, if you imagine an AI that’s optimizing for my evaluation of good, I think the discrepancy between “Rohin’s directly instilled goals” and “Rohin’s CEV” is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I’d conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between “plans Rohin evaluates as good” and “Rohin’s directly instilled goals”.
I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.
This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the “difficulties” are only positive numbers, then if the difficulty for the direct instillation is dinstillation and the one for the grader optimization is dinstillation+devaluation , then there’s no debate that the latter is bigger than the former.
But if you allow directionality (even in one dimension), then there’s the risk that the sum leads to less difficulty in total (by having the devaluation move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don’t see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.
Grader-optimization has the benefit that you don’t have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I’d estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.
Having direct-values of “Rohin’s current values” and “Rohin’s CEV” both seem fine. There is, however, significant discrepancy between “grader-optimize the plans Rohin evaluates as good” and “directly-value Rohin’s current values.”
In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
the first line seems to speculate that values-AGI is substantially more robust to differences in values
The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).
Sounds right. How does this answer my point 4?
I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don’t really buy that, seems like it depends on the size of the discrepancies.
For example, if you imagine an AI that’s optimizing for my evaluation of good, I think the discrepancy between “Rohin’s directly instilled goals” and “Rohin’s CEV” is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I’d conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between “plans Rohin evaluates as good” and “Rohin’s directly instilled goals”.
I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.
This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the “difficulties” are only positive numbers, then if the difficulty for the direct instillation is dinstillation and the one for the grader optimization is dinstillation+devaluation , then there’s no debate that the latter is bigger than the former.
But if you allow directionality (even in one dimension), then there’s the risk that the sum leads to less difficulty in total (by having the devaluation move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don’t see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.
Two responses:
Grader-optimization has the benefit that you don’t have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I’d estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.
I understand you to have just said:
In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).