Another way of thinking about this is that by crystalizing at least some parts of the AGI’s network into slowly-changing structures, we allow time to thoroughly test those parts. It seems very hard to thoroughly test models for safety in a paradigm where the whole model is potentially retrained regularly.
We need to test designs, and most specifically alignment designs, but giving up retraining (ie lifetime learning) and burning circuits into silicon is unlikely to be competitive; throwing out the baby with the bathwater.
Also whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex, it’s near purely a function of what is steering the planning system.
Would you care to flesh this assertion out a bit more?
To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.
whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex
As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. Obviously you’d need higher order components too. Like I said, I am imagining here that the majority of the model is “off the shelf”, and just a thin layer is usecase-specific.
To make this more explicit, if you had not only off-the-shelf visual cortex, but also spatio-temporal reasoning modules built atop (as the human brain does), then you could point your debugger at the contents of that module and understand what entities in space were being perceived at what time. And the mapping of “high level strategies” to “low level entities” would be a per-model bit of interpretability work, but should become more tractable to the extent that those low level entities are already mapped and understood.
So for the explicit problem that the ELK paper was trying to solve, if you are confident you know what underlying representation SmartVault is using, it’s much easier to interpret its higher-level actions/strategies.
We need to test designs, and most specifically alignment designs, but giving up retraining (ie lifetime learning) and burning circuits into silicon is unlikely to be competitive; throwing out the baby with the bathwater.
Also whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex, it’s near purely a function of what is steering the planning system.
Would you care to flesh this assertion out a bit more?
To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.
As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. Obviously you’d need higher order components too. Like I said, I am imagining here that the majority of the model is “off the shelf”, and just a thin layer is usecase-specific.
To make this more explicit, if you had not only off-the-shelf visual cortex, but also spatio-temporal reasoning modules built atop (as the human brain does), then you could point your debugger at the contents of that module and understand what entities in space were being perceived at what time. And the mapping of “high level strategies” to “low level entities” would be a per-model bit of interpretability work, but should become more tractable to the extent that those low level entities are already mapped and understood.
So for the explicit problem that the ELK paper was trying to solve, if you are confident you know what underlying representation SmartVault is using, it’s much easier to interpret its higher-level actions/strategies.