While I don’t disagree with the reasoning, I disagree with the main thrust of this post. Under my inside view, I think that we should accept that there will be human models in AI systems and figure out how to deal with them. From an outside view perspective, I agree that work on avoiding human models is neglected, but it seems like this is because it only matters in a very particular set of futures. If you want to avoid human models, it seems that a better approach would be to figure out how to navigate into that set of futures.
Avoiding human models necessarily loses a lot of performance
(This point is similar to the ones made in the “Usefulness” and “Specification Competitiveness” sections, stated more strongly and more abstractly. It may be obvious; if so, feel free to skip to the next section.)
Consider the following framework. There is a very large space of behaviors (or even just goals) that an AI system could have, and there need to be a lot of bits of information in order to select the behavior/goal that we actually want from our AI system. Each bit of information corresponds to halving the space of possible behaviors and goals that the AI system could have, if the AI started out as potentially having any possible behavior/goal. (A more formal treatment would consider the entropy of the distribution over possible behaviors/goals.)
Note that this is a very broad definition of “bits of information about the desired behavior/goal”: for example, I think that “ceteris paribus, we prefer low impact actions and plans” counts as a (relatively) small number of bits, and these are the bits that impact measures are working with.
It is also important that the bits of information are interpreted correctly by the AI system. I have said before that I worry that an impact measure strong enough to prevent all catastrophes would probably lead to an AI system that never does anything; in this framework, my concern is that the bits of information provided by an impact measure are being misinterpreted as definitively choosing a particular behavior/goal (i.e. providing the maximum possible number of bits, rather than the relatively small number of bits it should be). I’m more excited about learning from the state of the world because there are more bits (since you can tell which impactful behaviors are good vs. bad), and the bits are interpreted more correctly (since it is interpreted as Bayesian evidence rather than a definitive answer).
In this framework, the most useful AI systems will be the ones that can get and correctly interpret the largest number of bits about what humans want; and behave reasonably with respect to any remaining uncertainty about behavior/goals. But even having a good idea of what the desired goal/behavior is means that you understand humans very well; which means that you are capable of modeling humans, and leads to all of the problems mentioned in this post. (ETA: Note that these could be implicit human models.) So, in order to avoid these problems, you need to have your AI systems have fewer bits about the desired goal/behavior. Such systems will not be nearly as useful and will have artificial upper limits on performance in particular domains. (Compare our current probably-rule-based Siri with something along the lines of Samantha from Her.)
(The Less Independent Audits point cuts against this slightly, but in my opinion not by much.)
Is it okay to sacrifice performance?
While it is probably technically feasible to create AI systems without human models, it does not seem strategically feasible to me. That said, there are some strategic views under which this seems feasible. The key property you need is that we do not build the most useful AI systems before we have solved issues with human models; i.e. we have to be able to sacrifice the competitiveness desideratum.
This could be done with very strong global coordination, but my guess is that this article is not thinking about that case so I’ll ignore that possibility. It could also be done by having a single actor (or aligned group of actors) develop AGI with a discontinuous leap in capabilities, and the resulting AGI then quickly improves enough to execute a pivotal act. That actor can then unilaterally decide not to create the most useful AI systems from that point on, and prevent them from having human models.
How does current research on avoiding human models help in this scenario?
If the hope is to prevent human models after the pivotal act, that doesn’t seem to rely much on current technical research—the most significant challenge is in having a value aligned actor create AGI in the first place; after which you could presumably take your time solving AI safety concerns. Of course having some technical research on what to do after the pivotal act would be useful in convincing actors in the first place, but that’s a very different argument for the importance of this research and I would expect to do significantly different things to achieve this goal.
That leads me to conclude that this research would be impactful by preventing human models before a pivotal act. This means that we need to create an AI that (with the assistance of humans) executes a plan that leads the humans + AI to take over the world—but the AI must do this without being able to consider how human society will respond to any action it takes (since that would require human models). This seems to limit you to plans that humans come up with, which can make use of specific narrow “superpowers” (e.g. powerful new technology). This seems to me particularly difficult to accomplish, but I don’t have a strong argument for this besides intuition.
It could be that all the other paths seem even more doomed; if that’s the main motivation for this then I think that claim should be added somewhere in this post.
Summary
It seems like work on technical AI safety research without human models is especially impactful only in the scenario where a single actor uses the work in order to create an AI system that without modeling humans is able to execute a pivotal act (which usually also rests on assumptions of discontinuous AI progress and/or some form of fast takeoff). This seems particularly unlikely to me. If this is the main motivating scenario, it also places further constraints on technical safety research that avoids human models: the safety measures need to be loose enough that the AI system is still able to help humans execute a pivotal act.
Another plausible scenario would be strong global coordination around not building dangerous AI systems including ones that have world models. I don’t have strong inside view beliefs on that scenario but my guess is other people are pessimistic about that scenario.
Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment.
I claim that all of these approaches appear not to rely on human modeling because they are only arguing for safety properties and not usefulness properties, and in order for them to be useful they will need to model humans. (The one exception might be tool AIs + formal specifications, but for the reasons in the parent comment I think that these will have an upper limit on usefulness.)
While I don’t disagree with the reasoning, I disagree with the main thrust of this post. Under my inside view, I think that we should accept that there will be human models in AI systems and figure out how to deal with them. From an outside view perspective, I agree that work on avoiding human models is neglected, but it seems like this is because it only matters in a very particular set of futures. If you want to avoid human models, it seems that a better approach would be to figure out how to navigate into that set of futures.
Avoiding human models necessarily loses a lot of performance
(This point is similar to the ones made in the “Usefulness” and “Specification Competitiveness” sections, stated more strongly and more abstractly. It may be obvious; if so, feel free to skip to the next section.)
Consider the following framework. There is a very large space of behaviors (or even just goals) that an AI system could have, and there need to be a lot of bits of information in order to select the behavior/goal that we actually want from our AI system. Each bit of information corresponds to halving the space of possible behaviors and goals that the AI system could have, if the AI started out as potentially having any possible behavior/goal. (A more formal treatment would consider the entropy of the distribution over possible behaviors/goals.)
Note that this is a very broad definition of “bits of information about the desired behavior/goal”: for example, I think that “ceteris paribus, we prefer low impact actions and plans” counts as a (relatively) small number of bits, and these are the bits that impact measures are working with.
It is also important that the bits of information are interpreted correctly by the AI system. I have said before that I worry that an impact measure strong enough to prevent all catastrophes would probably lead to an AI system that never does anything; in this framework, my concern is that the bits of information provided by an impact measure are being misinterpreted as definitively choosing a particular behavior/goal (i.e. providing the maximum possible number of bits, rather than the relatively small number of bits it should be). I’m more excited about learning from the state of the world because there are more bits (since you can tell which impactful behaviors are good vs. bad), and the bits are interpreted more correctly (since it is interpreted as Bayesian evidence rather than a definitive answer).
In this framework, the most useful AI systems will be the ones that can get and correctly interpret the largest number of bits about what humans want; and behave reasonably with respect to any remaining uncertainty about behavior/goals. But even having a good idea of what the desired goal/behavior is means that you understand humans very well; which means that you are capable of modeling humans, and leads to all of the problems mentioned in this post. (ETA: Note that these could be implicit human models.) So, in order to avoid these problems, you need to have your AI systems have fewer bits about the desired goal/behavior. Such systems will not be nearly as useful and will have artificial upper limits on performance in particular domains. (Compare our current probably-rule-based Siri with something along the lines of Samantha from Her.)
(The Less Independent Audits point cuts against this slightly, but in my opinion not by much.)
Is it okay to sacrifice performance?
While it is probably technically feasible to create AI systems without human models, it does not seem strategically feasible to me. That said, there are some strategic views under which this seems feasible. The key property you need is that we do not build the most useful AI systems before we have solved issues with human models; i.e. we have to be able to sacrifice the competitiveness desideratum.
This could be done with very strong global coordination, but my guess is that this article is not thinking about that case so I’ll ignore that possibility. It could also be done by having a single actor (or aligned group of actors) develop AGI with a discontinuous leap in capabilities, and the resulting AGI then quickly improves enough to execute a pivotal act. That actor can then unilaterally decide not to create the most useful AI systems from that point on, and prevent them from having human models.
How does current research on avoiding human models help in this scenario?
If the hope is to prevent human models after the pivotal act, that doesn’t seem to rely much on current technical research—the most significant challenge is in having a value aligned actor create AGI in the first place; after which you could presumably take your time solving AI safety concerns. Of course having some technical research on what to do after the pivotal act would be useful in convincing actors in the first place, but that’s a very different argument for the importance of this research and I would expect to do significantly different things to achieve this goal.
That leads me to conclude that this research would be impactful by preventing human models before a pivotal act. This means that we need to create an AI that (with the assistance of humans) executes a plan that leads the humans + AI to take over the world—but the AI must do this without being able to consider how human society will respond to any action it takes (since that would require human models). This seems to limit you to plans that humans come up with, which can make use of specific narrow “superpowers” (e.g. powerful new technology). This seems to me particularly difficult to accomplish, but I don’t have a strong argument for this besides intuition.
It could be that all the other paths seem even more doomed; if that’s the main motivation for this then I think that claim should be added somewhere in this post.
Summary
It seems like work on technical AI safety research without human models is especially impactful only in the scenario where a single actor uses the work in order to create an AI system that without modeling humans is able to execute a pivotal act (which usually also rests on assumptions of discontinuous AI progress and/or some form of fast takeoff). This seems particularly unlikely to me. If this is the main motivating scenario, it also places further constraints on technical safety research that avoids human models: the safety measures need to be loose enough that the AI system is still able to help humans execute a pivotal act.
Another plausible scenario would be strong global coordination around not building dangerous AI systems including ones that have world models. I don’t have strong inside view beliefs on that scenario but my guess is other people are pessimistic about that scenario.
I claim that all of these approaches appear not to rely on human modeling because they are only arguing for safety properties and not usefulness properties, and in order for them to be useful they will need to model humans. (The one exception might be tool AIs + formal specifications, but for the reasons in the parent comment I think that these will have an upper limit on usefulness.)