I’ll contribute and say, this is good news, yet let’s be careful.
My points as I see them:
You are notably optimistic about formally verifying properties in extremely complex domains. This is the use case of a superhuman theorem prover, and you may well be right. It may be harder than you think though.
If true, the natural abstraction hypothesis is completely correct, albeit that doesn’t remove all the risk (though mesa-optimizers can be dealt with.)
I’m excited to hear your thoughts on this work, as well.
It will be at least as hard as simulating a human to prove through one. but I think you can simplify the scenarios you need to prove about. my view is the key proof we end up caring about will probably not be that much more complicated than the ones about the optimality of diffusion models (which are not very strong statements). I expect there will be some similar thing like diffusion that we want to prove in order to maximize safe intelligence while proving away unsafe patterns.
is there an equivalent for diffusion that:
can be stated about arbitrary physical volumes,
acts as a generalized model of agentic coprotection and co-optionality between any arbitrary physical volumes,
later when it starts working more easily, adversarial margins can be generated for the this diffusion++ metric, and thereby can be used to prove no adversarial examples closer than a given distance
then this allows propagating trust reliably out through the sensors and reaching consensus that there’s a web of sensors having justified true belief that they’re being friendly with their environments.
I’ll contribute and say, this is good news, yet let’s be careful.
My points as I see them:
You are notably optimistic about formally verifying properties in extremely complex domains. This is the use case of a superhuman theorem prover, and you may well be right. It may be harder than you think though.
If true, the natural abstraction hypothesis is completely correct, albeit that doesn’t remove all the risk (though mesa-optimizers can be dealt with.)
I’m excited to hear your thoughts on this work, as well.
It will be at least as hard as simulating a human to prove through one. but I think you can simplify the scenarios you need to prove about. my view is the key proof we end up caring about will probably not be that much more complicated than the ones about the optimality of diffusion models (which are not very strong statements). I expect there will be some similar thing like diffusion that we want to prove in order to maximize safe intelligence while proving away unsafe patterns.
is there an equivalent for diffusion that:
can be stated about arbitrary physical volumes,
acts as a generalized model of agentic coprotection and co-optionality between any arbitrary physical volumes,
later when it starts working more easily, adversarial margins can be generated for the this diffusion++ metric, and thereby can be used to prove no adversarial examples closer than a given distance
then this allows propagating trust reliably out through the sensors and reaching consensus that there’s a web of sensors having justified true belief that they’re being friendly with their environments.
I’m still trying to figure out what my thoughts are on open source game theory and neural networks though. I saw there are already follow-ups to this, and proving through these could start to really directly impact the sort of decision theory stuff miri is always yelling at a cloud about: https://www.semanticscholar.org/paper/Off-Belief-Learning-Hu-Lerer/6f7eb6062cc4e8feecca0202f634257d1752f795