Note that the “thin layer” is what you need to do regardless. If your machine observes, from either robotics or some video it found online, a situation. Say the situation is an object being dropped. Physics sim predicts a fairly vigorous bounce, actual object splats.
You need some way to correct the physics sims predictions to correspond with actual results, and the simplest way is a neural network that learns the mapping between (real world input, physics engine prediction) ->(real world observation).
You have to then use the predicted real world observation or probability distribution of predicted observations for your machine reasoning about what it should do.
I realized the reference “thin layer” is ambiguous in my post, just wanted to confirm if you were referring to the general case “”thin model, fat services”, or the specific safety question at the bottom “is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it”? My child reply assumed the former, but on consideration/re-reading I suspect the latter might be more likely?
Thinking about ways in which this safety margin could break; is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it
I suppose that a mapping task might fall under the heading of a mesa-optimizer, where what it is doing is optimizing for fidelity between between the outputs of the language layer and the inputs of the physics layer. This would be in addition to the mesa-optimization going on just in the physics simulator. Working title:
With “thin model / fat service” I’m attempting to contrast with the typical end-to-end model architecture, where there is no separate “physics simulator”, and instead the model just learns its own physics model, embedded with all the other relations that it has learned. So under that dichotomy, I think there is no “thin layer in front of the physics simulation” in an end-to-end model, as any part of the physics simulator can connect to or be connected from any other part of the model.
In such an end-to-end model, it’s really hard to figure out where the “physics simulator” (such as it is) even lives, and what kind of ontology it uses. I’m explicitly thinking of to the ELK paper’s Ontology Identification problem here. How do we know the AI is being honest when it says “I predict that object X will be in position Y at time T”? If you can interpret its physics model, you can see what simulations it is running and know if it’s being honest.
It’s possible, if we completely solve interpretability, to extract the end-to-end model’s physics simulator, I suppose. But if we have completely solved interpretability, I think we’re most of the way towards solving alignment. On the other hand, if we use Services (e.g. MuJoCo for Physics) then we already have interpretability on that service, by construction, without having to interpret whatever structures the model has learned to solve physics problems.
You need some way to correct the physics sims predictions to correspond with actual results
Sure, in some sense an end-to-end model needs to normalize inputs, extract features, and so on, before invoking its physics knowledge. But the point I’m getting at is that with Services we know for sure what the model’s representation has been normalized to (since we can just look at the data passed to MuJoCo), whereas this is (currently) opaque in an end-to-end model.
Note that you would probably never use such a “model learns it’s own physics” as the solution in any production AI system. For either the reason you gave, or more pedantically, so long as rival architectures using a human written physics system at the core perform more consistently in testing, you should never pick the less reliable solution.
Note that the “thin layer” is what you need to do regardless. If your machine observes, from either robotics or some video it found online, a situation. Say the situation is an object being dropped. Physics sim predicts a fairly vigorous bounce, actual object splats.
You need some way to correct the physics sims predictions to correspond with actual results, and the simplest way is a neural network that learns the mapping between (real world input, physics engine prediction) ->(real world observation).
You have to then use the predicted real world observation or probability distribution of predicted observations for your machine reasoning about what it should do.
I realized the reference “thin layer” is ambiguous in my post, just wanted to confirm if you were referring to the general case “”thin model, fat services”, or the specific safety question at the bottom “is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it”? My child reply assumed the former, but on consideration/re-reading I suspect the latter might be more likely?
I suppose that a mapping task might fall under the heading of a mesa-optimizer, where what it is doing is optimizing for fidelity between between the outputs of the language layer and the inputs of the physics layer. This would be in addition to the mesa-optimization going on just in the physics simulator. Working title:
CAIS: The Case For Maximum Mesa Optimizers
With “thin model / fat service” I’m attempting to contrast with the typical end-to-end model architecture, where there is no separate “physics simulator”, and instead the model just learns its own physics model, embedded with all the other relations that it has learned. So under that dichotomy, I think there is no “thin layer in front of the physics simulation” in an end-to-end model, as any part of the physics simulator can connect to or be connected from any other part of the model.
In such an end-to-end model, it’s really hard to figure out where the “physics simulator” (such as it is) even lives, and what kind of ontology it uses. I’m explicitly thinking of to the ELK paper’s Ontology Identification problem here. How do we know the AI is being honest when it says “I predict that object X will be in position Y at time T”? If you can interpret its physics model, you can see what simulations it is running and know if it’s being honest.
It’s possible, if we completely solve interpretability, to extract the end-to-end model’s physics simulator, I suppose. But if we have completely solved interpretability, I think we’re most of the way towards solving alignment. On the other hand, if we use Services (e.g. MuJoCo for Physics) then we already have interpretability on that service, by construction, without having to interpret whatever structures the model has learned to solve physics problems.
Sure, in some sense an end-to-end model needs to normalize inputs, extract features, and so on, before invoking its physics knowledge. But the point I’m getting at is that with Services we know for sure what the model’s representation has been normalized to (since we can just look at the data passed to MuJoCo), whereas this is (currently) opaque in an end-to-end model.
Note that you would probably never use such a “model learns it’s own physics” as the solution in any production AI system. For either the reason you gave, or more pedantically, so long as rival architectures using a human written physics system at the core perform more consistently in testing, you should never pick the less reliable solution.