comment I decided to post out of context for now since it’s rambling:
formal verification is a type of execution that can backtrack in response to model failures. you’re not wrong, but formally verifying a neural network is possible; the strongest adversarial resistances are formal verification and diffusion; both can protect a margin to decision boundary of a linear subnet of an NN, the formal one can do it with zero error but needs fairly well trained weights to finish efficiently. the problem is that any network capable of complex behavior is likely to be representing a complex enough system that you can’t even in principle verify the whole thing because there is no correct simulation that is friendly to your verifier—reality just has too much entropy and so you can’t check a property of a physics model of that region.
but, uncheckability is itself often a trait of overly chaotic regions of latent phase spaces. multiscale modeling with local internal constraints might be able to scale this kind of formal adversarial example resistance. alternatively if we can encode a friendliness-seeking process via diffusion, that works too. cite “gradient descent is not all you need”, cite formal verification paper
in both cases the question is, is there a metric about the internal function of the network that can represent some minimum desiderata necessary to get our friendliness seeking process within range of our verifier? eg an l1 or soft l0 loss, perhaps weighted by a game theoretic analysis, I still need to look up what a shapley value actually is.
ultimately you want your system built out of parts that each can be verified to error out if used in symbolic behavior they detect as having unfriendly dynamics, so what defines unfriendly dynamics? this is where the dynamical systems view of safety comes in[citation needed], and also relates well to memory safety, see rustlang. I need to look more into what the folks talking about shard safety are getting at, but I suspect it’s a similar concept: you want your network of material interactions, whether within a neural network or outside of it, to seek efficient simplicity in order to encode the same thing; the efficiency prior, aka compression applied to total compute, also partially optimizes non-interference.
so while it’s not a full basis for morality afaict and insufficient for full RL strength safety, it seems to me like we could at least bound the local direction in world model state space of the local pareto frontier of efficiency-without-additional-moral-loss by asserting that wasted energy burn is obviously to us humans a clear moral loss and that that energy should be spent on efficient thought towards whatever your material values are; this both means lower internal interference (used as an auxiliary objective) and means the model is closer to a basic standard of morality. that means any energy burn that clearly interferes with memory, such as fire, disease, or death, is a clear error state in our neural multiscale quantum chemistry model unless outweighed by better use of that energy to preserve self-process. yudkowsky seems to think this isn’t verifiable; I don’t see any reason why not other than “it’s hard” or “I have no idea how to write a proof of how we make everything protect that everything protects itself and others”. but we should be able to verify that really big world models don’t contain self-interference behavior, and as far as I can tell, improved structured model compression should both inherently make it easier and inherently optimize a basic moral good of non-waste just by making the system more efficient.
so then the question becomes how to encode memory of other life and rights for which life gets to continue to add memory to the global log of time. assuming we’ve verified that our physics simplifier doesn’t ever forget anything below noise (more precise definition of information usability needed), we still need to verify what makes a control feedback system morally significant. I suspect all control feedback systems have moral significance of some degree. the question is how much and what processes should continue vs simply be remembered in frozen state.
comment I decided to post out of context for now since it’s rambling:
formal verification is a type of execution that can backtrack in response to model failures. you’re not wrong, but formally verifying a neural network is possible; the strongest adversarial resistances are formal verification and diffusion; both can protect a margin to decision boundary of a linear subnet of an NN, the formal one can do it with zero error but needs fairly well trained weights to finish efficiently. the problem is that any network capable of complex behavior is likely to be representing a complex enough system that you can’t even in principle verify the whole thing because there is no correct simulation that is friendly to your verifier—reality just has too much entropy and so you can’t check a property of a physics model of that region.
but, uncheckability is itself often a trait of overly chaotic regions of latent phase spaces. multiscale modeling with local internal constraints might be able to scale this kind of formal adversarial example resistance. alternatively if we can encode a friendliness-seeking process via diffusion, that works too. cite “gradient descent is not all you need”, cite formal verification paper
in both cases the question is, is there a metric about the internal function of the network that can represent some minimum desiderata necessary to get our friendliness seeking process within range of our verifier? eg an l1 or soft l0 loss, perhaps weighted by a game theoretic analysis, I still need to look up what a shapley value actually is.
ultimately you want your system built out of parts that each can be verified to error out if used in symbolic behavior they detect as having unfriendly dynamics, so what defines unfriendly dynamics? this is where the dynamical systems view of safety comes in[citation needed], and also relates well to memory safety, see rustlang. I need to look more into what the folks talking about shard safety are getting at, but I suspect it’s a similar concept: you want your network of material interactions, whether within a neural network or outside of it, to seek efficient simplicity in order to encode the same thing; the efficiency prior, aka compression applied to total compute, also partially optimizes non-interference.
so while it’s not a full basis for morality afaict and insufficient for full RL strength safety, it seems to me like we could at least bound the local direction in world model state space of the local pareto frontier of efficiency-without-additional-moral-loss by asserting that wasted energy burn is obviously to us humans a clear moral loss and that that energy should be spent on efficient thought towards whatever your material values are; this both means lower internal interference (used as an auxiliary objective) and means the model is closer to a basic standard of morality. that means any energy burn that clearly interferes with memory, such as fire, disease, or death, is a clear error state in our neural multiscale quantum chemistry model unless outweighed by better use of that energy to preserve self-process. yudkowsky seems to think this isn’t verifiable; I don’t see any reason why not other than “it’s hard” or “I have no idea how to write a proof of how we make everything protect that everything protects itself and others”. but we should be able to verify that really big world models don’t contain self-interference behavior, and as far as I can tell, improved structured model compression should both inherently make it easier and inherently optimize a basic moral good of non-waste just by making the system more efficient.
so then the question becomes how to encode memory of other life and rights for which life gets to continue to add memory to the global log of time. assuming we’ve verified that our physics simplifier doesn’t ever forget anything below noise (more precise definition of information usability needed), we still need to verify what makes a control feedback system morally significant. I suspect all control feedback systems have moral significance of some degree. the question is how much and what processes should continue vs simply be remembered in frozen state.