An AI that is connected to the internet and has access to many gadgets and points of contact can better manipulate the world and thus do dangerous things more easily. However, if an AI would be considered dangerous if it had access to some or all of these things, it should also be considered dangerous without it, because giving such a system access to the outside world, either accidentally or on purpose, could cause a catastrophe without further changing the system itself. Dynamite is considered dangerous even if there is no burning match held next to it. Restricting access to the outside world should instead be regarded as a potential measure to contain or control a potentially dangerous AI and should be seen as inherently insecure.
My disagreement here is threefold:
The above statement appears to assume that dangerous transformative AI has already been created, whereas ‘red lines’ set through shared consensus and global regulation should be set to prevent the creation of such AI in the first place (with a wide margin of safety to account for unknown unknowns and that some actors will unilaterally attempt to cross the red lines anyway).
My rough sense is that the most dangerous kind of ‘general’ capabilities that could be developed in self-learning machine architectures are those that can be directed to enact internally modelled changes over physical distances within many different contexts of the outside world. These are different kind of capability than eg. containing general knowledge about facts of the world, or of say making calibrated predictions of the final conditions of linear or quasi-linear systems in the outside world.
Such ‘real world’ capabilities seem to need many degrees of freedom in external inputs and outputs to be iteratively trained into a model.
This is where the analogy of AI’s potential with dynamite’s potential for danger does not hold: - Dynamite has explosive potential from the get go (fortunately limited to a physical radius) but stays (mostly) chemically inert after production. It does not need further contact points of interaction with physical surroundings to acquire this potential for human-harmful impact. - A self-learning machine architecture gains increasing potential for wide-scale human lethality (through general modelling/regulatory functions that could be leveraged or repurposed to modify conditions of the outside environment in self-reinforcing loops that humans can no longer contain) via long causal trajectories of the architecture’s internals having interacted at many contact points with the outside world in the past. The initially produced ‘design blueprint’ does not immediately acquire this potential through production of needed hardware and initialisation of model weights.
If engineers end up connecting up more internet channels, sensors and actuators for large ML model training and deployment while continuing to tinker with the model’s underlying code base, then from a control engineering perspective, they are setting up a fragile system that is prone to inducing cascading failures in the future. Engineers should IMO not be connecting up what amounts to architectures that open-endedly learn spaghetti code and autonomously enact changes in the real world. This, in my view, would be an engineering malpractice where practitioners are grossly negligent in preventing risks to humans living everywhere around the planet.
You can have hidden functional misalignments selected for through local interactions of code internal to the architecture with their embedded surroundings. Here are arguments I wrote on that:
A model can be trained to cause effects we deem functional but under different interactions with structural aspects of the training environment than we expected. Such a model’s intended effects are not robust to shifts of the distribution of input data received when the model is deployed in new environments. Example: in deployment this game agent ‘captures’ a wall rather than the coin it got trained to capture (incidentally next to the right-most wall).
Compared to side-scroller games, real-life interactions are much more dimensionally complex. If we train a Deep RL model on a high-bandwidth stream of high-fidelity multimodal inputs from the physical environment in interaction with other agentic beings, we have no way of knowing whether any hidden causal structure got selected for and stays latent even during deployment test runs… until a rare set of interactions triggers it to cause outside effects that are out of line.
Core to the problem of goal misgeneralization in machine learning is that latent functions of internal code are being expressed under unknown interactions with the environment. A model that coherently overgeneralizes functional metrics over human contexts is concerning but trackable. Internal variance being selected to act out of line all over the place is not trackable.
Note that an ML model trains on signals that are coupled to existing local causal structures (as simulated on eg. localized servers or as sensed within local physical surroundings). Thus, the space of possible goal structures that can be selected for within an ML model is constrained by features that can be derived from data inputs received from local environments. Goals are locally selected for and thus partly non-orthogonal (cannot vary independently) with intelligence.
The above statement appears to assume that dangerous transformative AI has already been created,
Not at all. I’m just saying that if any AI with external access would be considered dangerous, then the same AI without access should be considered dangerous as well.
The dynamite analogy was of course not meant to be a model for AI, I just wanted to point out that even an inert mass that in principle any child could play with without coming to harm is still considered dangerous, because under certain circumstances it will be harmful. Dynamite + fire = damage, dynamite w/o fire = still dangerous.
Your third argument seems to prove my point: An AI that seems aligned in the training environment turns out to be misaligned if applied outside of the training distribution. If that can happen, the AI should be considered dangerous, even if within the training distribution it shows no signs of it.
I’m just saying that if any AI with external access would be considered dangerous
I’m saying that general-purpose ML architectures would develop especially dangerous capabilities by being trained in high-fidelity and high-bandwidth input-output interactions with the real outside world.
A specific cruxy statement that I disagree on:
My disagreement here is threefold:
The above statement appears to assume that dangerous transformative AI has already been created, whereas ‘red lines’ set through shared consensus and global regulation should be set to prevent the creation of such AI in the first place (with a wide margin of safety to account for unknown unknowns and that some actors will unilaterally attempt to cross the red lines anyway).
My rough sense is that the most dangerous kind of ‘general’ capabilities that could be developed in self-learning machine architectures are those that can be directed to enact internally modelled changes over physical distances within many different contexts of the outside world. These are different kind of capability than eg. containing general knowledge about facts of the world, or of say making calibrated predictions of the final conditions of linear or quasi-linear systems in the outside world.
Such ‘real world’ capabilities seem to need many degrees of freedom in external inputs and outputs to be iteratively trained into a model.
This is where the analogy of AI’s potential with dynamite’s potential for danger does not hold:
- Dynamite has explosive potential from the get go (fortunately limited to a physical radius) but stays (mostly) chemically inert after production. It does not need further contact points of interaction with physical surroundings to acquire this potential for human-harmful impact.
- A self-learning machine architecture gains increasing potential for wide-scale human lethality (through general modelling/regulatory functions that could be leveraged or repurposed to modify conditions of the outside environment in self-reinforcing loops that humans can no longer contain) via long causal trajectories of the architecture’s internals having interacted at many contact points with the outside world in the past. The initially produced ‘design blueprint’ does not immediately acquire this potential through production of needed hardware and initialisation of model weights.
If engineers end up connecting up more internet channels, sensors and actuators for large ML model training and deployment while continuing to tinker with the model’s underlying code base, then from a control engineering perspective, they are setting up a fragile system that is prone to inducing cascading failures in the future. Engineers should IMO not be connecting up what amounts to architectures that open-endedly learn spaghetti code and autonomously enact changes in the real world. This, in my view, would be an engineering malpractice where practitioners are grossly negligent in preventing risks to humans living everywhere around the planet.
You can have hidden functional misalignments selected for through local interactions of code internal to the architecture with their embedded surroundings. Here are arguments I wrote on that:
A model can be trained to cause effects we deem functional but under different interactions with structural aspects of the training environment than we expected. Such a model’s intended effects are not robust to shifts of the distribution of input data received when the model is deployed in new environments. Example: in deployment this game agent ‘captures’ a wall rather than the coin it got trained to capture (incidentally next to the right-most wall).
Compared to side-scroller games, real-life interactions are much more dimensionally complex. If we train a Deep RL model on a high-bandwidth stream of high-fidelity multimodal inputs from the physical environment in interaction with other agentic beings, we have no way of knowing whether any hidden causal structure got selected for and stays latent even during deployment test runs… until a rare set of interactions triggers it to cause outside effects that are out of line.
Core to the problem of goal misgeneralization in machine learning is that latent functions of internal code are being expressed under unknown interactions with the environment. A model that coherently overgeneralizes functional metrics over human contexts is concerning but trackable. Internal variance being selected to act out of line all over the place is not trackable.
Note that an ML model trains on signals that are coupled to existing local causal structures (as simulated on eg. localized servers or as sensed within local physical surroundings). Thus, the space of possible goal structures that can be selected for within an ML model is constrained by features that can be derived from data inputs received from local environments. Goals are locally selected for and thus partly non-orthogonal (cannot vary independently) with intelligence.
Not at all. I’m just saying that if any AI with external access would be considered dangerous, then the same AI without access should be considered dangerous as well.
The dynamite analogy was of course not meant to be a model for AI, I just wanted to point out that even an inert mass that in principle any child could play with without coming to harm is still considered dangerous, because under certain circumstances it will be harmful. Dynamite + fire = damage, dynamite w/o fire = still dangerous.
Your third argument seems to prove my point: An AI that seems aligned in the training environment turns out to be misaligned if applied outside of the training distribution. If that can happen, the AI should be considered dangerous, even if within the training distribution it shows no signs of it.
I’m saying that general-purpose ML architectures would develop especially dangerous capabilities by being trained in high-fidelity and high-bandwidth input-output interactions with the real outside world.