Thank you so much, Beth, for your extremely insightful comment! I really appreciate your time.
I completely agree with everything you said. I agree that “you can make many very useful predictions about humans by modeling them as (for example) having some set of goals, habits, tendencies, and knowledge,” and that these insights will be very useful for alignment research.
I also agree that “it’s difficult to identify what a human’s intentions are just by having access to their brain.” This was actually the main point I wanted to get across; I guess it wasn’t clearly communicated. Sorry about the confusion!
My assertion was that in order to predict the interaction dynamics of a computationally irreducible agent with a complex deployment environment, there are two realistic options:
Run the agent in an exact copy of the environment and see what happens.
If the deployment environment is unknown, use the available empirical data to develop a simplified model of the system based on parsimonious first principles that are likely to be valid even in the unknown deployment environment. The predictions yielded by such models have a chance of generalizing out-of-distribution, although they will necessarily be limited in scope.
When researchers try to predict intent from internal data, their assumptions/first principles (based on the limited empirical data they have) will probably not be guaranteed to be “valid even in the unknown deployment enviroment.” Hence, there is little robust reason to believe that the predictions based on these model assumptions will be generalizable out-of-distribution.
At some points in your comment you use the criterion “likely to be valid”, at other points you use the criterion “guaranteed to be valid”. These are very different! I think almost everyone agrees that we’re unlikely to get predictions which are guaranteed to be valid out-of-distribution. But that’s true of every science apart from fundamental physics: they all apply coarse-grained models, whose predictive power out-of-distribution varies very widely. There are indeed some domains in which it’s very weak (like ecology), but also some domains in which it’s pretty strong (like chemistry). There are some reasons to think interpretability will be more like the former (networks are very complicated!) and some reasons to think it’ll be more like the latter (experiments with networks are very reproducible). I don’t think this is the type of thing which can be predicted very well in advance, because it’s very hard to know what types of fundamental breakthroughs may arise.
More generally, the notion of “computational irreducibility” doesn’t seem very useful to me, because it takes a continuous property (some systems are easier or harder to make predictions about) and turns it into a binary property (is it computationally reducible or not), which I think obscures more than it clarifies.
Thank you so much, Beth, for your extremely insightful comment! I really appreciate your time.
I completely agree with everything you said. I agree that “you can make many very useful predictions about humans by modeling them as (for example) having some set of goals, habits, tendencies, and knowledge,” and that these insights will be very useful for alignment research.
I also agree that “it’s difficult to identify what a human’s intentions are just by having access to their brain.” This was actually the main point I wanted to get across; I guess it wasn’t clearly communicated. Sorry about the confusion!
My assertion was that in order to predict the interaction dynamics of a computationally irreducible agent with a complex deployment environment, there are two realistic options:
Run the agent in an exact copy of the environment and see what happens.
If the deployment environment is unknown, use the available empirical data to develop a simplified model of the system based on parsimonious first principles that are likely to be valid even in the unknown deployment environment. The predictions yielded by such models have a chance of generalizing out-of-distribution, although they will necessarily be limited in scope.
When researchers try to predict intent from internal data, their assumptions/first principles (based on the limited empirical data they have) will probably not be guaranteed to be “valid even in the unknown deployment enviroment.” Hence, there is little robust reason to believe that the predictions based on these model assumptions will be generalizable out-of-distribution.
At some points in your comment you use the criterion “likely to be valid”, at other points you use the criterion “guaranteed to be valid”. These are very different! I think almost everyone agrees that we’re unlikely to get predictions which are guaranteed to be valid out-of-distribution. But that’s true of every science apart from fundamental physics: they all apply coarse-grained models, whose predictive power out-of-distribution varies very widely. There are indeed some domains in which it’s very weak (like ecology), but also some domains in which it’s pretty strong (like chemistry). There are some reasons to think interpretability will be more like the former (networks are very complicated!) and some reasons to think it’ll be more like the latter (experiments with networks are very reproducible). I don’t think this is the type of thing which can be predicted very well in advance, because it’s very hard to know what types of fundamental breakthroughs may arise.
More generally, the notion of “computational irreducibility” doesn’t seem very useful to me, because it takes a continuous property (some systems are easier or harder to make predictions about) and turns it into a binary property (is it computationally reducible or not), which I think obscures more than it clarifies.