I think the world where H is true is a good world, because it’s a world where we are much closer to understanding and predicting how sophisticated models generalize.
This seemed liked a really surprising sentence to me. If the model is an agent, doesn’t that pull in all the classic concerns related to treacherous turns and so on? Whereas a non-agent probably won’t have an incentive to deceive you?
Even if the model is an agent, then you still need to be able to understand its goals based on their internal representation. Which could mean, for example, understanding what a deep neural network was doing. Which doesn’t appear to be much easier than the original task of “understand what a model, for example a deep neural network, is doing”.
This seemed liked a really surprising sentence to me. If the model is an agent, doesn’t that pull in all the classic concerns related to treacherous turns and so on? Whereas a non-agent probably won’t have an incentive to deceive you?
Even if the model is an agent, then you still need to be able to understand its goals based on their internal representation. Which could mean, for example, understanding what a deep neural network was doing. Which doesn’t appear to be much easier than the original task of “understand what a model, for example a deep neural network, is doing”.