But, having good predictive accuracy is instrumentally useful for maximizing the reward signal, so we can expect that its implicit representation of the world continually improves (i.e., it comes to find a nice efficient encoding). We don’t have to worry about this—the AI is incentivized to get this right.
The AI is incentivized to get this right only in directions that increase approval. If the AI discovers something the human operator would disapprove of learning, it is incentivized to obscure that fact or act as if it didn’t know it. (This works both for “oh, here’s an easy way to kill all humans” and “oh, it turns out God isn’t real.”)
yes, but its underlying model is still accurate, even if it doesn’t reveal that to us? I wasn’t claiming that the AI would reveal to us all of the truths it learns.
yes, but its underlying model is still accurate, even if it doesn’t reveal that to us?
This depends on whether it thinks we would approve more of it having an accurate model and deceiving us or having an inaccurate model in the way we want its model to be less accurate. Some algorithmic bias work is of the form “the system shouldn’t take in inputs X, or draw conclusions Y, because that violates a deontological rule, and simple accuracy-maximization doesn’t incentivize following that rule.”
My point is something like “the genius of approval-directed agency is that it grounds out every meta-level in ‘approval,’ but this is also (potentially) the drawback of approval-directed agency.” Specifically, for any potentially good property the system might have (like epistemic accuracy) you need to check whether that actually in-all-cases for-all-users maximizes approval, because if it doesn’t, then the approval-directed agent is incentivized to not have that property.
[The deeper philosophical question here is something like “does ethics backchain or forwardchain?”, as we’re either grounding things out in what will believe or what we believe now, and approval-direction is more the latter, and CEV-like things are more the former.]
Note that I wasn’t talking about approval directed agents in the part you originally quoted. I was saying that normal maximizers will learn to build good models as part of capability generalization.
The AI is incentivized to get this right only in directions that increase approval. If the AI discovers something the human operator would disapprove of learning, it is incentivized to obscure that fact or act as if it didn’t know it. (This works both for “oh, here’s an easy way to kill all humans” and “oh, it turns out God isn’t real.”)
yes, but its underlying model is still accurate, even if it doesn’t reveal that to us? I wasn’t claiming that the AI would reveal to us all of the truths it learns.
Perhaps I misunderstand your point.
This depends on whether it thinks we would approve more of it having an accurate model and deceiving us or having an inaccurate model in the way we want its model to be less accurate. Some algorithmic bias work is of the form “the system shouldn’t take in inputs X, or draw conclusions Y, because that violates a deontological rule, and simple accuracy-maximization doesn’t incentivize following that rule.”
My point is something like “the genius of approval-directed agency is that it grounds out every meta-level in ‘approval,’ but this is also (potentially) the drawback of approval-directed agency.” Specifically, for any potentially good property the system might have (like epistemic accuracy) you need to check whether that actually in-all-cases for-all-users maximizes approval, because if it doesn’t, then the approval-directed agent is incentivized to not have that property.
[The deeper philosophical question here is something like “does ethics backchain or forwardchain?”, as we’re either grounding things out in what will believe or what we believe now, and approval-direction is more the latter, and CEV-like things are more the former.]
Note that I wasn’t talking about approval directed agents in the part you originally quoted. I was saying that normal maximizers will learn to build good models as part of capability generalization.
Oh! Sorry, I missed the “How does this compare with” line.