I saw this paper and wanted to get really excited about it at y’all. I want more of a chatty atmosphere here, I have lots to say and want to debate many papers. some thoughts :
seems to me that there are true shapes to the behaviors of physical reality[1]. we can in fact find ways to verify assertions about them[2]; it’s going to be hard, though. we need to be able to scale interpretability to the point that we can check for implementation bugs automatically and reliably. in order to get more interpretable sparsity, I think we need models 100x larger to do the same thing, so that every subnetwork is doing a coherent operation to only its inputs with no interference. then, we can pass type information in from sensors and do formal verification that the implemented coordination of the learned network’s components only propagate energy in ways that conserves properties at every step. that basic component then would free us from any adversarial examples to that property. we might even be able to constrain architecture by the property, so that you can’t even pass through a given broken representation.
given ability to error check a property precisely, we can then talk about formally verifying coordination systems. this is where open problems in open source game theory come in. when models can formally verify things about each other, what happens? would the models still cooperate with models they can’t verify are being honest about their source code? how do we avoid sudden weird-as-fuck domain generalization errors that result from the difference between agents that can be verified and agents that cannot?
so then that means using the very best of formal verification to check that there aren’t bad coordination patterns in the network (if you take every neuron to be a separate module). what statement can you check that doesn’t leave the statement’s fingerprints on the agents? seems like it’s something about providing freedom from unwanted aesthetic interference. which means that every subcomponent of an ai you’re creating needs to have a way to calculate whether a behavior would satisfy the usefulness constraints that nearby agentic shapes in the universe want out of it. there are many types of reasoning errors one can make that incorrectly represent the state of a material system, but if you can accurately simulate a cell and verify a statement about its behavior, you can formally verify whether the cell dies.
I think one key litmus test of any alignment idea is whether it’s easy to explain how it also aligns a coordination network of human cells against cancer, or of a network of humans against . on a really deep level, I don’t think alignment is anything more than the process of trying to solve evolutionary game theory among [ais, humans]. historically, with high death rate, many humans have picked defect strategies[3]; when we compare to “the ai does not love you, nor does it hate you, but you are made of atoms the ai can use for something else” quote to hate between humans, it seems like with humans, hate is when one wants to use those humans for something besides their own life-aesthetic desires! love for a person or thing is a behavior in the territory as long as the person doing the acting has the capability to accomplish the behavior.
the question is what pruning would occur after these coming capability advancements. my hope is that we can ensure that the pruning is sub-organism edits as much as possible and that all edits are by consent and work by simply constraining the violators, and that we preserve all extant life-self-agents, even if we don’t give every one of them as much control over other matter. the question is, given the possibility that some subagents will act adversarially, how do we ensure the overall network of agents can detect malice and demand the malicious agent be interacted with using malice-safety gloves until the agent has self-modified to become a mental shape that reduces malice.
(will edit to improve citations, check back in a couple hours if you don’t want to hunt down each paper by name)
[1] many things to cite for why I think this: neural radiance fields’ 3d prior; relative representations paper linked above; quantum/quantum chemistry/fluid/multimaterial/coupled-dynamical-systems simulators of various shapes; geometry of neural behavior video; cybernetics; systems science;
[2] many more things to cite for why I think this:
[3]
I’ll contribute and say, this is good news, yet let’s be careful.
My points as I see them:
You are notably optimistic about formally verifying properties in extremely complex domains. This is the use case of a superhuman theorem prover, and you may well be right. It may be harder than you think though.
If true, the natural abstraction hypothesis is completely correct, albeit that doesn’t remove all the risk (though mesa-optimizers can be dealt with.)
I’m excited to hear your thoughts on this work, as well.
It will be at least as hard as simulating a human to prove through one. but I think you can simplify the scenarios you need to prove about. my view is the key proof we end up caring about will probably not be that much more complicated than the ones about the optimality of diffusion models (which are not very strong statements). I expect there will be some similar thing like diffusion that we want to prove in order to maximize safe intelligence while proving away unsafe patterns.
is there an equivalent for diffusion that:
can be stated about arbitrary physical volumes,
acts as a generalized model of agentic coprotection and co-optionality between any arbitrary physical volumes,
later when it starts working more easily, adversarial margins can be generated for the this diffusion++ metric, and thereby can be used to prove no adversarial examples closer than a given distance
then this allows propagating trust reliably out through the sensors and reaching consensus that there’s a web of sensors having justified true belief that they’re being friendly with their environments.
[posted to shortform due to incomplete draft]
I saw this paper and wanted to get really excited about it at y’all. I want more of a chatty atmosphere here, I have lots to say and want to debate many papers. some thoughts :
seems to me that there are true shapes to the behaviors of physical reality[1]. we can in fact find ways to verify assertions about them[2]; it’s going to be hard, though. we need to be able to scale interpretability to the point that we can check for implementation bugs automatically and reliably. in order to get more interpretable sparsity, I think we need models 100x larger to do the same thing, so that every subnetwork is doing a coherent operation to only its inputs with no interference. then, we can pass type information in from sensors and do formal verification that the implemented coordination of the learned network’s components only propagate energy in ways that conserves properties at every step. that basic component then would free us from any adversarial examples to that property. we might even be able to constrain architecture by the property, so that you can’t even pass through a given broken representation.
given ability to error check a property precisely, we can then talk about formally verifying coordination systems. this is where open problems in open source game theory come in. when models can formally verify things about each other, what happens? would the models still cooperate with models they can’t verify are being honest about their source code? how do we avoid sudden weird-as-fuck domain generalization errors that result from the difference between agents that can be verified and agents that cannot?
so then that means using the very best of formal verification to check that there aren’t bad coordination patterns in the network (if you take every neuron to be a separate module). what statement can you check that doesn’t leave the statement’s fingerprints on the agents? seems like it’s something about providing freedom from unwanted aesthetic interference. which means that every subcomponent of an ai you’re creating needs to have a way to calculate whether a behavior would satisfy the usefulness constraints that nearby agentic shapes in the universe want out of it. there are many types of reasoning errors one can make that incorrectly represent the state of a material system, but if you can accurately simulate a cell and verify a statement about its behavior, you can formally verify whether the cell dies.
I think one key litmus test of any alignment idea is whether it’s easy to explain how it also aligns a coordination network of human cells against cancer, or of a network of humans against . on a really deep level, I don’t think alignment is anything more than the process of trying to solve evolutionary game theory among [ais, humans]. historically, with high death rate, many humans have picked defect strategies[3]; when we compare to “the ai does not love you, nor does it hate you, but you are made of atoms the ai can use for something else” quote to hate between humans, it seems like with humans, hate is when one wants to use those humans for something besides their own life-aesthetic desires! love for a person or thing is a behavior in the territory as long as the person doing the acting has the capability to accomplish the behavior.
the question is what pruning would occur after these coming capability advancements. my hope is that we can ensure that the pruning is sub-organism edits as much as possible and that all edits are by consent and work by simply constraining the violators, and that we preserve all extant life-self-agents, even if we don’t give every one of them as much control over other matter. the question is, given the possibility that some subagents will act adversarially, how do we ensure the overall network of agents can detect malice and demand the malicious agent be interacted with using malice-safety gloves until the agent has self-modified to become a mental shape that reduces malice.
(will edit to improve citations, check back in a couple hours if you don’t want to hunt down each paper by name)
[1] many things to cite for why I think this: neural radiance fields’ 3d prior; relative representations paper linked above; quantum/quantum chemistry/fluid/multimaterial/coupled-dynamical-systems simulators of various shapes; geometry of neural behavior video; cybernetics; systems science; [2] many more things to cite for why I think this: [3]
I’ll contribute and say, this is good news, yet let’s be careful.
My points as I see them:
You are notably optimistic about formally verifying properties in extremely complex domains. This is the use case of a superhuman theorem prover, and you may well be right. It may be harder than you think though.
If true, the natural abstraction hypothesis is completely correct, albeit that doesn’t remove all the risk (though mesa-optimizers can be dealt with.)
I’m excited to hear your thoughts on this work, as well.
It will be at least as hard as simulating a human to prove through one. but I think you can simplify the scenarios you need to prove about. my view is the key proof we end up caring about will probably not be that much more complicated than the ones about the optimality of diffusion models (which are not very strong statements). I expect there will be some similar thing like diffusion that we want to prove in order to maximize safe intelligence while proving away unsafe patterns.
is there an equivalent for diffusion that:
can be stated about arbitrary physical volumes,
acts as a generalized model of agentic coprotection and co-optionality between any arbitrary physical volumes,
later when it starts working more easily, adversarial margins can be generated for the this diffusion++ metric, and thereby can be used to prove no adversarial examples closer than a given distance
then this allows propagating trust reliably out through the sensors and reaching consensus that there’s a web of sensors having justified true belief that they’re being friendly with their environments.
I’m still trying to figure out what my thoughts are on open source game theory and neural networks though. I saw there are already follow-ups to this, and proving through these could start to really directly impact the sort of decision theory stuff miri is always yelling at a cloud about: https://www.semanticscholar.org/paper/Off-Belief-Learning-Hu-Lerer/6f7eb6062cc4e8feecca0202f634257d1752f795