my objection to this objection is that for the most part, we don’t have an option not to pick the best feedback signal we have available at any given time. from a systems perspective, systems alignment only generalizes strongly if it improves capability enough for the relevant system to survive in competition with other systems. this is true at many scales of systems, but it’s always for the same reason: competition between systems means that the most adaptive approach wins. a common mistake is to assume “adaptive” means “growth/accumulation/capture rate”, but what it really means is “durability per unit efficiency”: the instrumental drive to capture resources as fast as possible is fundamentally a decision theory error made by local optimizers.
to consider a specific example of some systems with this decision theory error, a limitation when gene driving mosquitos, for example, is that if the genes you add don’t make the modified mosquitos enough more adaptive, they’ll just die out; you’d need to perform some sort of trade where you offer the modified mosquitos a modified non-animal food source that only the modified mosquitos can eat, and that somehow can’t be separated from the gene drive; you need to offer them a genetic update rule that reliably produces cooperation between species. if you can offer this, then mosquitos which become modified will be durably more competitive, because they have access to food sources that would poison unmodified mosquitos, and they can incrementally no longer threaten humans, so humans would no longer be seeking a way to entirely destroy the species. but it only works if you can get the mosquitos to coordinate en masse, and any mutation that makes that mosquito a defector against mosquito-veganism needs to be stopped in its tracks. the mosquito swarm has to reproduce away the interspecies defection strategy and then not allow it to return, while simultaneously preserving the species.
similarly in most forms of ai safety, there are at least three major labs you need to convince: deepmind, openai, and <whatever is going on over in china>. there are also others that will replicate experiments and some that will perform high quality experiments with somewhat less compute funding. between all of them, you have to come up with a mechanism of alignment that improves capability and which also is convergent about the alignment: if your alignment system doesn’t get better alignment-durability/watt as a result of capability improvement, you haven’t actually aligned anything, just papered over a problem. to some degree you can hope that one of these labs gets there first; but because capability growth is incremental, it’s looking less and less likely that there will be a single watershed moment where a lab pulls so far ahead that no competition can be mounted. and during that window, defense of friendly systems needs to become stronger than inter-system attack.
(by system, again, I mean any organism or meta-organism or neuron or cell or anything inbetween.)
one example goal of something we need an aligned planetary system of beings to do is take control of the ecosystem enough to solve global warming. but in order to do that without screwing everything up, we need a clear picture of what forms of interference with what parts of the universe are acceptable: some clear notion of multi-tenant ownership that allows interfacing the needs of multiple subsystems to determine what their requirements are for their adjacent systems.
I find it notable and interesting that anthropic’s recent research about interpretability (SoLU paper) focuses on isolating individual neurons’ concept ownership, so that the privileged basis isolates them from interfering with each other. I’m intentionally stretching how far I can generalize this, but I really think this direction of reasoning has something interesting to say about ownership of matter as well. local internal coherence of matter ownership is a core property of a human body that should not be violated; while it’s hard to precisely identify whether it’s been violated subtly, sudden death is an easy to identify example of a state transition where the local informational process of a human existing has suddenly ceased and the low-entropy complexity was lost. at the same time, anthropic’s paper is related to previous work on compressibility; attempting to improve interpretability ultimately boils down to attempting to improve the representation quality until it reaches a coherent, distilled structure that can be understood, as discussed in that paper.
I’d argue that, inherently, improvements to interpretability focused on coherent binding to physical variables have a fundamental connection to the potential to improve the formalizeability of the functions a neural network represents. and that that kind of improvement has the potential to allow binding the optimality of your main loss function more accurately to the functions you intend to optimize in the first place.
So then my question becomes—what competitive rules do we want to apply to all scales (within bacteria, within a neural network, within an insect, within a mammal, within a species, within a planet, between planets), in order to get representations at every scale that coherently describe what dynamics are acceptable interference and what are not?
again, I’m pulling together tenuous threads that I can’t quite tie properly, and some of the links might be invalid. I’m a software engineer first, research ideas generator second—and I might be seeing ghosts. but I suspect that somewhere in game theory/game dynamics, there’s an insight about how to structure competition in constructed processes that allows describing how to teach the universe to remember everything anyone ever considered beautiful, or something along those lines.
If this thread is of interest, I’d like to discuss it with more people. I’ve got some links in other posts as well.
I’m interested in this line of reasoning. I can’t really say much in response right now, but I just read that paper you linked—they write such clear and easily, heh, interpretable papers don’t they? - and I have strong opinions about “the correct value system” being rooted in maximizing some weighted sum of the “autonomy” of all living / agentic / intelligent systems, which it seems like you’re gesturing towards as well. I’m interested in trying to figure out how to formalize this.
my objection to this objection is that for the most part, we don’t have an option not to pick the best feedback signal we have available at any given time. from a systems perspective, systems alignment only generalizes strongly if it improves capability enough for the relevant system to survive in competition with other systems. this is true at many scales of systems, but it’s always for the same reason: competition between systems means that the most adaptive approach wins. a common mistake is to assume “adaptive” means “growth/accumulation/capture rate”, but what it really means is “durability per unit efficiency”: the instrumental drive to capture resources as fast as possible is fundamentally a decision theory error made by local optimizers.
to consider a specific example of some systems with this decision theory error, a limitation when gene driving mosquitos, for example, is that if the genes you add don’t make the modified mosquitos enough more adaptive, they’ll just die out; you’d need to perform some sort of trade where you offer the modified mosquitos a modified non-animal food source that only the modified mosquitos can eat, and that somehow can’t be separated from the gene drive; you need to offer them a genetic update rule that reliably produces cooperation between species. if you can offer this, then mosquitos which become modified will be durably more competitive, because they have access to food sources that would poison unmodified mosquitos, and they can incrementally no longer threaten humans, so humans would no longer be seeking a way to entirely destroy the species. but it only works if you can get the mosquitos to coordinate en masse, and any mutation that makes that mosquito a defector against mosquito-veganism needs to be stopped in its tracks. the mosquito swarm has to reproduce away the interspecies defection strategy and then not allow it to return, while simultaneously preserving the species.
similarly in most forms of ai safety, there are at least three major labs you need to convince: deepmind, openai, and <whatever is going on over in china>. there are also others that will replicate experiments and some that will perform high quality experiments with somewhat less compute funding. between all of them, you have to come up with a mechanism of alignment that improves capability and which also is convergent about the alignment: if your alignment system doesn’t get better alignment-durability/watt as a result of capability improvement, you haven’t actually aligned anything, just papered over a problem. to some degree you can hope that one of these labs gets there first; but because capability growth is incremental, it’s looking less and less likely that there will be a single watershed moment where a lab pulls so far ahead that no competition can be mounted. and during that window, defense of friendly systems needs to become stronger than inter-system attack.
(by system, again, I mean any organism or meta-organism or neuron or cell or anything inbetween.)
one example goal of something we need an aligned planetary system of beings to do is take control of the ecosystem enough to solve global warming. but in order to do that without screwing everything up, we need a clear picture of what forms of interference with what parts of the universe are acceptable: some clear notion of multi-tenant ownership that allows interfacing the needs of multiple subsystems to determine what their requirements are for their adjacent systems.
I find it notable and interesting that anthropic’s recent research about interpretability (SoLU paper) focuses on isolating individual neurons’ concept ownership, so that the privileged basis isolates them from interfering with each other. I’m intentionally stretching how far I can generalize this, but I really think this direction of reasoning has something interesting to say about ownership of matter as well. local internal coherence of matter ownership is a core property of a human body that should not be violated; while it’s hard to precisely identify whether it’s been violated subtly, sudden death is an easy to identify example of a state transition where the local informational process of a human existing has suddenly ceased and the low-entropy complexity was lost. at the same time, anthropic’s paper is related to previous work on compressibility; attempting to improve interpretability ultimately boils down to attempting to improve the representation quality until it reaches a coherent, distilled structure that can be understood, as discussed in that paper.
I’d argue that, inherently, improvements to interpretability focused on coherent binding to physical variables have a fundamental connection to the potential to improve the formalizeability of the functions a neural network represents. and that that kind of improvement has the potential to allow binding the optimality of your main loss function more accurately to the functions you intend to optimize in the first place.
So then my question becomes—what competitive rules do we want to apply to all scales (within bacteria, within a neural network, within an insect, within a mammal, within a species, within a planet, between planets), in order to get representations at every scale that coherently describe what dynamics are acceptable interference and what are not?
again, I’m pulling together tenuous threads that I can’t quite tie properly, and some of the links might be invalid. I’m a software engineer first, research ideas generator second—and I might be seeing ghosts. but I suspect that somewhere in game theory/game dynamics, there’s an insight about how to structure competition in constructed processes that allows describing how to teach the universe to remember everything anyone ever considered beautiful, or something along those lines.
If this thread is of interest, I’d like to discuss it with more people. I’ve got some links in other posts as well.
I’m interested in this line of reasoning. I can’t really say much in response right now, but I just read that paper you linked—they write such clear and easily, heh, interpretable papers don’t they? - and I have strong opinions about “the correct value system” being rooted in maximizing some weighted sum of the “autonomy” of all living / agentic / intelligent systems, which it seems like you’re gesturing towards as well. I’m interested in trying to figure out how to formalize this.