I’ve been thinking along similar lines recently. A possible path to AI safety that I’ve been thinking about extends upon this:
A promising concrete endgame story along these lines is Ought’s plan to avoid the dangerous attractor state of AI systems that are optimized end-to-end
Technological Attractor: Off-the-shelf subsystems
One possible tech-tree path is that we start building custom silicon to implement certain subsystems in an AI agent. These components would be analogous to functional neural regions of the human brain such as the motor cortex, visual system, etc. -- the key hypothesis being that once we reach a certain level of model complexity, the benefits from training a model end-to-end are not worth the costs of re-learning all of these fundamental structures, and furthermore that you can get much better performance-per-cost by casting these modular, reusable components onto an ASIC. This could be a more feasible way of achieving something like Microscope AI.
Given a few such low-level components, we could enter a technological attractor where getting SOTA through another approach requires either i) throwing multiple OOM more compute than previous SOTA at re-training your own copy of these components in an end-to-end model, or ii) building your own microchip fab to implement your custom component design. Both of these could be high enough barriers that in practice the market participants simply use the off-the-shelf components. And in this attractor more R&D goes into building better-performing and higher-abstraction components that can be combined arbitrarily for different tasks.
Given such off-the-shelf components, say, a standard visual cortex for image recognition, or a 3D spatial reasoning system which can be connected atop a visual cortex, or a generic working memory, etc., it becomes much more tractable to engage in interpretability work that can convince us of the safety of these components. That then just leaves the problem of proving the highest-level network is safe (i.e. whatever is trained to use these components). This is by no means trivial, but for example the ELK paper spent a lot of time worrying about whether you can even trust the basic ontology of an agent, and I think this path would make Ontology Identification tractable. (Of course, this by construction excludes a number of possible agent-configurations).
Now, it’s possible one could plug in these components and still accidentally have the trained agent learn its own replacement visual cortex that is deceptive and not interpretable, but I think that we can get an intuition for when this might happen. If the best off-the-shelf visual cortex module has complexity on the order of N parameters, and our new model that stitches together these lower-level modules has N parameters or less, then it seems implausible that it has somehow learned a new visual cortex network that is much more compact/efficient than the previous state of the art—increasingly implausible the more resources are devoted to optimizing these components. (It would be like someone coming along today and claiming they had achieved a 10x performance improvement on CPU clock speed; given the huge research budgets and historical gradual capability increases, we don’t expect this to happen). On the other hand, we should be suspicious of a N×102 parameter model where we are unable to interpret what 10% of the network is doing; that’s easily enough complexity to hide its own unsafe re-implementation of our components. (I’m aware that there’s a lot of ground in between these two points, I’m merely trying to illustrate that there is “likely safe” and “likely unsafe” ground, rather than claim exactly how big they each are.)
The final step here is the shakiest. It’s not clear to me that we can keep the “top layer” (the actual network that is stitching together the low-level components; perhaps the Neocortex, by analogy to human neural architecture?) thin enough to be obviously not learning its own unsafe component-replacements. However, I think this framework at least paints a picture of a “known safe” or at least “likely safe” path to AGI; if we see that the practical engineering and economic decisions produce thin top-layer models using thick component layers, then we can devote energy to proving the components are safe/interpretable by construction, and exploring the interpretation of the top-level networks that consume the lower-level components. AGI “neurobiology” will be much more tractable if the “neural architecture” is relatively standardized. And so, this could be a good place to provide an early nudge to tip the system into this attractor; heavy investment into research on componentized NN architectures could be viewed as “gain of function” research, but it could also have a much safer end-point.
Another way of thinking about this is that by crystalizing at least some parts of the AGI’s network into slowly-changing structures, we allow time to thoroughly test those parts. It seems very hard to thoroughly test models for safety in a paradigm where the whole model is potentially retrained regularly.
Interesting, I haven’t seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.
One other thought after considering this a bit more—we could test this now using software submodules. It’s unlikely to perform better (since no hardware speedup) but it could shed light on the tradeoffs with the general approach. And as these submodules got more complex, it may eventually be beneficial to use this approach even in a pure-software (no hardware) paradigm, if it lets you skip retraining a bunch of common functionality.
I.e. if you train a sub-network for one task, then incorporate that in two distinct top-layer networks trained on different high-level goals, do you get savings by not having to train two “visual cortexes”?
This is in a similar vein to Google’s foundation models, where they train one jumbo model that then gets specialized for each usecase. Can that foundation model be modularized? (Maybe for relatively narrow usecases like “text comprehension” it’s actually reasonable to think of a foundation model as a single submodule, but I think they are quite broad right now. ) The big difference is I think all the weights are mutable in the “refine the foundation model” step?
Perhaps another concrete proposal for a technological attractor would be to build a SOTA foundation model and make that so good that the community uses it instead of training their own, and then that would also give a slower-moving architecture/target to interpret.
Another way of thinking about this is that by crystalizing at least some parts of the AGI’s network into slowly-changing structures, we allow time to thoroughly test those parts. It seems very hard to thoroughly test models for safety in a paradigm where the whole model is potentially retrained regularly.
We need to test designs, and most specifically alignment designs, but giving up retraining (ie lifetime learning) and burning circuits into silicon is unlikely to be competitive; throwing out the baby with the bathwater.
Also whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex, it’s near purely a function of what is steering the planning system.
Would you care to flesh this assertion out a bit more?
To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.
whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex
As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. Obviously you’d need higher order components too. Like I said, I am imagining here that the majority of the model is “off the shelf”, and just a thin layer is usecase-specific.
To make this more explicit, if you had not only off-the-shelf visual cortex, but also spatio-temporal reasoning modules built atop (as the human brain does), then you could point your debugger at the contents of that module and understand what entities in space were being perceived at what time. And the mapping of “high level strategies” to “low level entities” would be a per-model bit of interpretability work, but should become more tractable to the extent that those low level entities are already mapped and understood.
So for the explicit problem that the ELK paper was trying to solve, if you are confident you know what underlying representation SmartVault is using, it’s much easier to interpret its higher-level actions/strategies.
I’ve been thinking along similar lines recently. A possible path to AI safety that I’ve been thinking about extends upon this:
Technological Attractor: Off-the-shelf subsystems
One possible tech-tree path is that we start building custom silicon to implement certain subsystems in an AI agent. These components would be analogous to functional neural regions of the human brain such as the motor cortex, visual system, etc. -- the key hypothesis being that once we reach a certain level of model complexity, the benefits from training a model end-to-end are not worth the costs of re-learning all of these fundamental structures, and furthermore that you can get much better performance-per-cost by casting these modular, reusable components onto an ASIC. This could be a more feasible way of achieving something like Microscope AI.
Given a few such low-level components, we could enter a technological attractor where getting SOTA through another approach requires either i) throwing multiple OOM more compute than previous SOTA at re-training your own copy of these components in an end-to-end model, or ii) building your own microchip fab to implement your custom component design. Both of these could be high enough barriers that in practice the market participants simply use the off-the-shelf components. And in this attractor more R&D goes into building better-performing and higher-abstraction components that can be combined arbitrarily for different tasks.
Given such off-the-shelf components, say, a standard visual cortex for image recognition, or a 3D spatial reasoning system which can be connected atop a visual cortex, or a generic working memory, etc., it becomes much more tractable to engage in interpretability work that can convince us of the safety of these components. That then just leaves the problem of proving the highest-level network is safe (i.e. whatever is trained to use these components). This is by no means trivial, but for example the ELK paper spent a lot of time worrying about whether you can even trust the basic ontology of an agent, and I think this path would make Ontology Identification tractable. (Of course, this by construction excludes a number of possible agent-configurations).
Now, it’s possible one could plug in these components and still accidentally have the trained agent learn its own replacement visual cortex that is deceptive and not interpretable, but I think that we can get an intuition for when this might happen. If the best off-the-shelf visual cortex module has complexity on the order of N parameters, and our new model that stitches together these lower-level modules has N parameters or less, then it seems implausible that it has somehow learned a new visual cortex network that is much more compact/efficient than the previous state of the art—increasingly implausible the more resources are devoted to optimizing these components. (It would be like someone coming along today and claiming they had achieved a 10x performance improvement on CPU clock speed; given the huge research budgets and historical gradual capability increases, we don’t expect this to happen). On the other hand, we should be suspicious of a N×102 parameter model where we are unable to interpret what 10% of the network is doing; that’s easily enough complexity to hide its own unsafe re-implementation of our components. (I’m aware that there’s a lot of ground in between these two points, I’m merely trying to illustrate that there is “likely safe” and “likely unsafe” ground, rather than claim exactly how big they each are.)
The final step here is the shakiest. It’s not clear to me that we can keep the “top layer” (the actual network that is stitching together the low-level components; perhaps the Neocortex, by analogy to human neural architecture?) thin enough to be obviously not learning its own unsafe component-replacements. However, I think this framework at least paints a picture of a “known safe” or at least “likely safe” path to AGI; if we see that the practical engineering and economic decisions produce thin top-layer models using thick component layers, then we can devote energy to proving the components are safe/interpretable by construction, and exploring the interpretation of the top-level networks that consume the lower-level components. AGI “neurobiology” will be much more tractable if the “neural architecture” is relatively standardized. And so, this could be a good place to provide an early nudge to tip the system into this attractor; heavy investment into research on componentized NN architectures could be viewed as “gain of function” research, but it could also have a much safer end-point.
Another way of thinking about this is that by crystalizing at least some parts of the AGI’s network into slowly-changing structures, we allow time to thoroughly test those parts. It seems very hard to thoroughly test models for safety in a paradigm where the whole model is potentially retrained regularly.
Interesting, I haven’t seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.
One other thought after considering this a bit more—we could test this now using software submodules. It’s unlikely to perform better (since no hardware speedup) but it could shed light on the tradeoffs with the general approach. And as these submodules got more complex, it may eventually be beneficial to use this approach even in a pure-software (no hardware) paradigm, if it lets you skip retraining a bunch of common functionality.
I.e. if you train a sub-network for one task, then incorporate that in two distinct top-layer networks trained on different high-level goals, do you get savings by not having to train two “visual cortexes”?
This is in a similar vein to Google’s foundation models, where they train one jumbo model that then gets specialized for each usecase. Can that foundation model be modularized? (Maybe for relatively narrow usecases like “text comprehension” it’s actually reasonable to think of a foundation model as a single submodule, but I think they are quite broad right now. ) The big difference is I think all the weights are mutable in the “refine the foundation model” step?
Perhaps another concrete proposal for a technological attractor would be to build a SOTA foundation model and make that so good that the community uses it instead of training their own, and then that would also give a slower-moving architecture/target to interpret.
We need to test designs, and most specifically alignment designs, but giving up retraining (ie lifetime learning) and burning circuits into silicon is unlikely to be competitive; throwing out the baby with the bathwater.
Also whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex, it’s near purely a function of what is steering the planning system.
Would you care to flesh this assertion out a bit more?
To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.
As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. Obviously you’d need higher order components too. Like I said, I am imagining here that the majority of the model is “off the shelf”, and just a thin layer is usecase-specific.
To make this more explicit, if you had not only off-the-shelf visual cortex, but also spatio-temporal reasoning modules built atop (as the human brain does), then you could point your debugger at the contents of that module and understand what entities in space were being perceived at what time. And the mapping of “high level strategies” to “low level entities” would be a per-model bit of interpretability work, but should become more tractable to the extent that those low level entities are already mapped and understood.
So for the explicit problem that the ELK paper was trying to solve, if you are confident you know what underlying representation SmartVault is using, it’s much easier to interpret its higher-level actions/strategies.