yes, but its underlying model is still accurate, even if it doesn’t reveal that to us?
This depends on whether it thinks we would approve more of it having an accurate model and deceiving us or having an inaccurate model in the way we want its model to be less accurate. Some algorithmic bias work is of the form “the system shouldn’t take in inputs X, or draw conclusions Y, because that violates a deontological rule, and simple accuracy-maximization doesn’t incentivize following that rule.”
My point is something like “the genius of approval-directed agency is that it grounds out every meta-level in ‘approval,’ but this is also (potentially) the drawback of approval-directed agency.” Specifically, for any potentially good property the system might have (like epistemic accuracy) you need to check whether that actually in-all-cases for-all-users maximizes approval, because if it doesn’t, then the approval-directed agent is incentivized to not have that property.
[The deeper philosophical question here is something like “does ethics backchain or forwardchain?”, as we’re either grounding things out in what will believe or what we believe now, and approval-direction is more the latter, and CEV-like things are more the former.]
Note that I wasn’t talking about approval directed agents in the part you originally quoted. I was saying that normal maximizers will learn to build good models as part of capability generalization.
This depends on whether it thinks we would approve more of it having an accurate model and deceiving us or having an inaccurate model in the way we want its model to be less accurate. Some algorithmic bias work is of the form “the system shouldn’t take in inputs X, or draw conclusions Y, because that violates a deontological rule, and simple accuracy-maximization doesn’t incentivize following that rule.”
My point is something like “the genius of approval-directed agency is that it grounds out every meta-level in ‘approval,’ but this is also (potentially) the drawback of approval-directed agency.” Specifically, for any potentially good property the system might have (like epistemic accuracy) you need to check whether that actually in-all-cases for-all-users maximizes approval, because if it doesn’t, then the approval-directed agent is incentivized to not have that property.
[The deeper philosophical question here is something like “does ethics backchain or forwardchain?”, as we’re either grounding things out in what will believe or what we believe now, and approval-direction is more the latter, and CEV-like things are more the former.]
Note that I wasn’t talking about approval directed agents in the part you originally quoted. I was saying that normal maximizers will learn to build good models as part of capability generalization.
Oh! Sorry, I missed the “How does this compare with” line.