So on the one hand this seems to echo Stuart Armstrong’s take on building oracle AI (paper here coauthored with Sandberg and Bostrom) where we might summarize the starting intuition as “build AI that is not an act-based agent so we avoid Goodharting in ways that pose an x-risk”. On the other, though, I remain suspicious of the idea that we can avoid dangerous Goodharting because optimizing for the measure of a variable rather than the variable itself is baked in at such a deep level that I’m inclined to think it more likely that we’ve fooled ourselves or failed to see far enough rather than overcoming Goodharting if we think that’s what we’ve found a way to do. Since you’ve just proposed an idea rather than something very specific I can’t say much more, but I think things of this class of approaches are unlikely to work, and in this case specifically my thinking caches out as predicting we’d never see this Scientist AI reach a point where we could trust it to do what we mean.
So on the one hand this seems to echo Stuart Armstrong’s take on building oracle AI (paper here coauthored with Sandberg and Bostrom) where we might summarize the starting intuition as “build AI that is not an act-based agent so we avoid Goodharting in ways that pose an x-risk”. On the other, though, I remain suspicious of the idea that we can avoid dangerous Goodharting because optimizing for the measure of a variable rather than the variable itself is baked in at such a deep level that I’m inclined to think it more likely that we’ve fooled ourselves or failed to see far enough rather than overcoming Goodharting if we think that’s what we’ve found a way to do. Since you’ve just proposed an idea rather than something very specific I can’t say much more, but I think things of this class of approaches are unlikely to work, and in this case specifically my thinking caches out as predicting we’d never see this Scientist AI reach a point where we could trust it to do what we mean.
Quite possibly. But I suspect that means that we will not be able to trust any AI to DWIM.