Would it satisfy your objections if the physics engine were a separate system, but that system was using only neural networks or other ML hypotheses capable of learning from feedback? So no hardcoded physics code, just a ML system that regresses between some input representation of the world and future predictions. And then we develop the representation with autoencoders so no human bias there either.
I don’t think so, no—the way I understand it, any kind of separation into separate systems falls into the CAIS sphere of thought. This is because we are measuring things from the capability side, rather than the generation-of-capability side: for example, I think it is still CAIS even if we take the exact same ML architecture and train copies of it into different specializations which then use each other as services.
There are a couple of things worth distinguishing, though:
That a sufficiently integrated CAIS is indistinguishable from a single general agent to us is what tells us CAIS isn’t safe either.
The arguments that a single general agent represents a different or more severe danger profile still motivate tracking the two paths differently.
I will also say I think this type of work could easily be part of the creation of a single general agent. If we consider Gato as the anchor for a general agent: many different tasks were captured in a single transformer, but as far as I understand it, Gato kept the input-to-task associations, which is to say the language inputs for the language tasks and the physics inputs for the physics tasks. But if the language model fed to Gato-2 uses this Mind’s Eye technique, it would be possible to do physics tasks from a language input and maybe also explain the physics tasks as a language output.
So before it could respond to sentences with other sentences, and it could respond to equations with other equations, but now it can process ancient geometry books and Newton’s Principia which use words to describe equations for certain, and maybe even compose outputs of a similar kind.
That a sufficiently integrated CAIS is indistinguishable from a single general agent to us is what tells us CAIS isn’t safe either.
Fleshing this point out, I think one can probably make conditional statistical arguments about safety here, to define what I think you are getting at with “sufficiently integrated”.
If your model is N parameters and integrates a bunch of Services, and we’ve established that a SOTA physics model requires N*100 parameters (the OP paper suggests that OOM difference), then it is likely safe to say that the model has not “re-learned” physics in some way that would render it illegible. (I’ve been thinking of this configuration as “thin model, fat modules”, maybe “thin model, fat services” fits better with the CAIS terminology). However another model at N*1000 would be able to embed a physics simulator, and therefore would be able to effectively re-implement it in an illegible/deceptive way.
Thinking about ways in which this safety margin could break; is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it without having to fully re-implement it? For example, running two simple problems in the simulator, then merging/inverting them in the model layer? Intuitively I suspect it’s hard to achieve much with this attack, but it’s probably not possible to rule out in general.
Note that the “thin layer” is what you need to do regardless. If your machine observes, from either robotics or some video it found online, a situation. Say the situation is an object being dropped. Physics sim predicts a fairly vigorous bounce, actual object splats.
You need some way to correct the physics sims predictions to correspond with actual results, and the simplest way is a neural network that learns the mapping between (real world input, physics engine prediction) ->(real world observation).
You have to then use the predicted real world observation or probability distribution of predicted observations for your machine reasoning about what it should do.
I realized the reference “thin layer” is ambiguous in my post, just wanted to confirm if you were referring to the general case “”thin model, fat services”, or the specific safety question at the bottom “is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it”? My child reply assumed the former, but on consideration/re-reading I suspect the latter might be more likely?
Thinking about ways in which this safety margin could break; is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it
I suppose that a mapping task might fall under the heading of a mesa-optimizer, where what it is doing is optimizing for fidelity between between the outputs of the language layer and the inputs of the physics layer. This would be in addition to the mesa-optimization going on just in the physics simulator. Working title:
With “thin model / fat service” I’m attempting to contrast with the typical end-to-end model architecture, where there is no separate “physics simulator”, and instead the model just learns its own physics model, embedded with all the other relations that it has learned. So under that dichotomy, I think there is no “thin layer in front of the physics simulation” in an end-to-end model, as any part of the physics simulator can connect to or be connected from any other part of the model.
In such an end-to-end model, it’s really hard to figure out where the “physics simulator” (such as it is) even lives, and what kind of ontology it uses. I’m explicitly thinking of to the ELK paper’s Ontology Identification problem here. How do we know the AI is being honest when it says “I predict that object X will be in position Y at time T”? If you can interpret its physics model, you can see what simulations it is running and know if it’s being honest.
It’s possible, if we completely solve interpretability, to extract the end-to-end model’s physics simulator, I suppose. But if we have completely solved interpretability, I think we’re most of the way towards solving alignment. On the other hand, if we use Services (e.g. MuJoCo for Physics) then we already have interpretability on that service, by construction, without having to interpret whatever structures the model has learned to solve physics problems.
You need some way to correct the physics sims predictions to correspond with actual results
Sure, in some sense an end-to-end model needs to normalize inputs, extract features, and so on, before invoking its physics knowledge. But the point I’m getting at is that with Services we know for sure what the model’s representation has been normalized to (since we can just look at the data passed to MuJoCo), whereas this is (currently) opaque in an end-to-end model.
Note that you would probably never use such a “model learns it’s own physics” as the solution in any production AI system. For either the reason you gave, or more pedantically, so long as rival architectures using a human written physics system at the core perform more consistently in testing, you should never pick the less reliable solution.
Would it satisfy your objections if the physics engine were a separate system, but that system was using only neural networks or other ML hypotheses capable of learning from feedback? So no hardcoded physics code, just a ML system that regresses between some input representation of the world and future predictions. And then we develop the representation with autoencoders so no human bias there either.
I don’t think so, no—the way I understand it, any kind of separation into separate systems falls into the CAIS sphere of thought. This is because we are measuring things from the capability side, rather than the generation-of-capability side: for example, I think it is still CAIS even if we take the exact same ML architecture and train copies of it into different specializations which then use each other as services.
There are a couple of things worth distinguishing, though:
That a sufficiently integrated CAIS is indistinguishable from a single general agent to us is what tells us CAIS isn’t safe either.
The arguments that a single general agent represents a different or more severe danger profile still motivate tracking the two paths differently.
I will also say I think this type of work could easily be part of the creation of a single general agent. If we consider Gato as the anchor for a general agent: many different tasks were captured in a single transformer, but as far as I understand it, Gato kept the input-to-task associations, which is to say the language inputs for the language tasks and the physics inputs for the physics tasks. But if the language model fed to Gato-2 uses this Mind’s Eye technique, it would be possible to do physics tasks from a language input and maybe also explain the physics tasks as a language output.
So before it could respond to sentences with other sentences, and it could respond to equations with other equations, but now it can process ancient geometry books and Newton’s Principia which use words to describe equations for certain, and maybe even compose outputs of a similar kind.
Fleshing this point out, I think one can probably make conditional statistical arguments about safety here, to define what I think you are getting at with “sufficiently integrated”.
If your model is N parameters and integrates a bunch of Services, and we’ve established that a SOTA physics model requires N*100 parameters (the OP paper suggests that OOM difference), then it is likely safe to say that the model has not “re-learned” physics in some way that would render it illegible. (I’ve been thinking of this configuration as “thin model, fat modules”, maybe “thin model, fat services” fits better with the CAIS terminology). However another model at N*1000 would be able to embed a physics simulator, and therefore would be able to effectively re-implement it in an illegible/deceptive way.
Thinking about ways in which this safety margin could break; is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it without having to fully re-implement it? For example, running two simple problems in the simulator, then merging/inverting them in the model layer? Intuitively I suspect it’s hard to achieve much with this attack, but it’s probably not possible to rule out in general.
Note that the “thin layer” is what you need to do regardless. If your machine observes, from either robotics or some video it found online, a situation. Say the situation is an object being dropped. Physics sim predicts a fairly vigorous bounce, actual object splats.
You need some way to correct the physics sims predictions to correspond with actual results, and the simplest way is a neural network that learns the mapping between (real world input, physics engine prediction) ->(real world observation).
You have to then use the predicted real world observation or probability distribution of predicted observations for your machine reasoning about what it should do.
I realized the reference “thin layer” is ambiguous in my post, just wanted to confirm if you were referring to the general case “”thin model, fat services”, or the specific safety question at the bottom “is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it”? My child reply assumed the former, but on consideration/re-reading I suspect the latter might be more likely?
I suppose that a mapping task might fall under the heading of a mesa-optimizer, where what it is doing is optimizing for fidelity between between the outputs of the language layer and the inputs of the physics layer. This would be in addition to the mesa-optimization going on just in the physics simulator. Working title:
CAIS: The Case For Maximum Mesa Optimizers
With “thin model / fat service” I’m attempting to contrast with the typical end-to-end model architecture, where there is no separate “physics simulator”, and instead the model just learns its own physics model, embedded with all the other relations that it has learned. So under that dichotomy, I think there is no “thin layer in front of the physics simulation” in an end-to-end model, as any part of the physics simulator can connect to or be connected from any other part of the model.
In such an end-to-end model, it’s really hard to figure out where the “physics simulator” (such as it is) even lives, and what kind of ontology it uses. I’m explicitly thinking of to the ELK paper’s Ontology Identification problem here. How do we know the AI is being honest when it says “I predict that object X will be in position Y at time T”? If you can interpret its physics model, you can see what simulations it is running and know if it’s being honest.
It’s possible, if we completely solve interpretability, to extract the end-to-end model’s physics simulator, I suppose. But if we have completely solved interpretability, I think we’re most of the way towards solving alignment. On the other hand, if we use Services (e.g. MuJoCo for Physics) then we already have interpretability on that service, by construction, without having to interpret whatever structures the model has learned to solve physics problems.
Sure, in some sense an end-to-end model needs to normalize inputs, extract features, and so on, before invoking its physics knowledge. But the point I’m getting at is that with Services we know for sure what the model’s representation has been normalized to (since we can just look at the data passed to MuJoCo), whereas this is (currently) opaque in an end-to-end model.
Note that you would probably never use such a “model learns it’s own physics” as the solution in any production AI system. For either the reason you gave, or more pedantically, so long as rival architectures using a human written physics system at the core perform more consistently in testing, you should never pick the less reliable solution.