Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on.
Furthermore, most of our focus will be on ensuring that your model is attempting to predict the right thing. That’s a very important thing almost regardless of your model’s actual capability level. As a simple example, in the same way that you probably shouldn’t trust a human who was doing their best to mimic what a malign superintelligence would do, you probably shouldn’t trust a human-level AI attempting to do that either, even if that AI (like the human) isn’t actually superintelligent.
Also, I don’t recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a “Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you’d like to take a look at this sequence!”
Evan Hubinger’s Conditioning Predictive Models sequence describes this scenario in detail.
In a great deal of detail, apparently, since it has a recommended reading time of 131 minutes.
Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on.
Here’s a relevant quote from the first essay in the sequence:
Also, I don’t recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a “Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you’d like to take a look at this sequence!”