Epistemic status: no claims to novelty, just (possibly) useful terminology.
[EDIT: I increased all the class numbers by 1 in order to admit a new definition of “class I”, see child comment.]
I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the “potential” class and the latter the “effective” class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).
Class II: Systems that only ever receive synthetic data that has nothing to do with the real world
Examples:
AI that is trained to learn Go by self-play
AI that is trained to prove random mathematical statements
AI that is trained to make rapid predictions of future cell states in the game of life for random initial conditions
AI that is trained to find regularities in sequences corresponding to random programs on some natural universal Turing machine with bounded runtime
Class II systems by and large don’t admit any attack vectors. [EDIT: Inaccurate, see child comment]
Such systems might have higher effective class if bugs in the implementation lead to real-world data leaking into the system, or if the ostensibly synthetic data reveals something important about the world via the choices made by its designers (for example, a video game with complex rules inspired by the real world).
Class III: Systems for which there is no distribution shift between training and deployment, and also the label space is small
Example: AI that is trained to distinguish between cat images and dog images, and the images selected for training are a perfectly representative sample of images used in deployment.
Class III systems admit attacks by non-Cartesian daemons.
If the label space is large, a Cartesian daemon can choose a low probability for randomly producing a malign label, such that there is a significant probability that this won’t happen in training but will happen in deployment. This moves the system to class IV. If there is distribution shift, a Cartesian daemon can distinguish between training and deployment and use it to perform a “treacherous turn” attack. This also moves the system to class IV.
Such systems have lower effective class if non-Cartesian daemons are mitigated, for example by well-designed applications of homomorphic cryptography. They have higher effective class if deployed in a setting which does involve distributional shift, perhaps unanticipated by the designers.
Class IV: Systems which are non-agentic but do involve distribution shift or large label space
Examples:
AI that learns to imitate humans
AI that learns to predict the stock market
Generative language models
Class IV systems admit attacks by Cartesian and non-Cartesian daemons. [EDIT: Also attack from counterfactuals. The latter requires a large label space and doesn’t require a distribution shift per se.]
Such systems have lower effective class if Cartesian daemons are mitigated, for example by carefully shaping the prior / inductive bias and applying some sort of confidence threshold / consensus algorithm. They can be effective class V if not designed to avoid self-fulfilling prophecies and/or incentives to mispredict at present to improve prediction in the future.
Class V: Agentic systems
Examples:
AI that trades in the stock market
AI that optimizes long-term ad revenue
AI that defends the world against unaligned AIs
Class V systems admit attacks by daemons but are also dangerous by default due to divergence of their utility function from the human utility function.
Such system can have lower effective class if the utility function is very cleverly designed, for example to reliably accomplish learning of human values.
Class II or higher systems might admit an attack vector by daemons that infer the universe from the agent’s source code. That is, we can imagine a malign hypothesis that makes a treacherous turn after observing enough past actions to infer information about the system’s own source code and infer the physical universe from that. (For example, in a TRL setting it can match the actions to the output of a particular program for envelope.) Such daemons are not as powerful as malign simulation hypotheses, since their prior probability is not especially large (compared to the true hypothesis), but might still be non-negligible. Moreover, it is not clear whether the source code can realistically have enough information to enable an attack, but the opposite is not entirely obvious.
To account for this I propose the designate class I systems which don’t admit this attack vector. For the potential sense, it means that either (i) the system’s design is too simple to enable inferring much about the physical universe, or (ii) there is no access to past actions (including opponent actions for self-play) or (iii) the label space is small, which means an attack requires making many distinct errors, and such errors are penalized quickly. And ofc it requires no direct access to the source code.
We can maybe imagine an attack vector even for class I systems, if most metacosmologically plausible universes are sufficiently similar, but this is not very likely. Nevertheless, we can reserve the label class 0 for systems that explicitly rule out even such attacks.
Epistemic status: no claims to novelty, just (possibly) useful terminology.
[EDIT: I increased all the class numbers by 1 in order to admit a new definition of “class I”, see child comment.]
I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the “potential” class and the latter the “effective” class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).
Class II: Systems that only ever receive synthetic data that has nothing to do with the real world
Examples:
AI that is trained to learn Go by self-play
AI that is trained to prove random mathematical statements
AI that is trained to make rapid predictions of future cell states in the game of life for random initial conditions
AI that is trained to find regularities in sequences corresponding to random programs on some natural universal Turing machine with bounded runtime
Class II systems by and large don’t admit any attack vectors. [EDIT: Inaccurate, see child comment]
Such systems might have higher effective class if bugs in the implementation lead to real-world data leaking into the system, or if the ostensibly synthetic data reveals something important about the world via the choices made by its designers (for example, a video game with complex rules inspired by the real world).
Class III: Systems for which there is no distribution shift between training and deployment, and also the label space is small
Example: AI that is trained to distinguish between cat images and dog images, and the images selected for training are a perfectly representative sample of images used in deployment.
Class III systems admit attacks by non-Cartesian daemons.
If the label space is large, a Cartesian daemon can choose a low probability for randomly producing a malign label, such that there is a significant probability that this won’t happen in training but will happen in deployment. This moves the system to class IV. If there is distribution shift, a Cartesian daemon can distinguish between training and deployment and use it to perform a “treacherous turn” attack. This also moves the system to class IV.
Such systems have lower effective class if non-Cartesian daemons are mitigated, for example by well-designed applications of homomorphic cryptography. They have higher effective class if deployed in a setting which does involve distributional shift, perhaps unanticipated by the designers.
Class IV: Systems which are non-agentic but do involve distribution shift or large label space
Examples:
AI that learns to imitate humans
AI that learns to predict the stock market
Generative language models
Class IV systems admit attacks by Cartesian and non-Cartesian daemons. [EDIT: Also attack from counterfactuals. The latter requires a large label space and doesn’t require a distribution shift per se.]
Such systems have lower effective class if Cartesian daemons are mitigated, for example by carefully shaping the prior / inductive bias and applying some sort of confidence threshold / consensus algorithm. They can be effective class V if not designed to avoid self-fulfilling prophecies and/or incentives to mispredict at present to improve prediction in the future.
Class V: Agentic systems
Examples:
AI that trades in the stock market
AI that optimizes long-term ad revenue
AI that defends the world against unaligned AIs
Class V systems admit attacks by daemons but are also dangerous by default due to divergence of their utility function from the human utility function.
Such system can have lower effective class if the utility function is very cleverly designed, for example to reliably accomplish learning of human values.
The idea comes from this comment of Eliezer.
Class II or higher systems might admit an attack vector by daemons that infer the universe from the agent’s source code. That is, we can imagine a malign hypothesis that makes a treacherous turn after observing enough past actions to infer information about the system’s own source code and infer the physical universe from that. (For example, in a TRL setting it can match the actions to the output of a particular program for envelope.) Such daemons are not as powerful as malign simulation hypotheses, since their prior probability is not especially large (compared to the true hypothesis), but might still be non-negligible. Moreover, it is not clear whether the source code can realistically have enough information to enable an attack, but the opposite is not entirely obvious.
To account for this I propose the designate class I systems which don’t admit this attack vector. For the potential sense, it means that either (i) the system’s design is too simple to enable inferring much about the physical universe, or (ii) there is no access to past actions (including opponent actions for self-play) or (iii) the label space is small, which means an attack requires making many distinct errors, and such errors are penalized quickly. And ofc it requires no direct access to the source code.
We can maybe imagine an attack vector even for class I systems, if most metacosmologically plausible universes are sufficiently similar, but this is not very likely. Nevertheless, we can reserve the label class 0 for systems that explicitly rule out even such attacks.