Would you care to flesh this assertion out a bit more?
To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.
whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex
As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. Obviously you’d need higher order components too. Like I said, I am imagining here that the majority of the model is “off the shelf”, and just a thin layer is usecase-specific.
To make this more explicit, if you had not only off-the-shelf visual cortex, but also spatio-temporal reasoning modules built atop (as the human brain does), then you could point your debugger at the contents of that module and understand what entities in space were being perceived at what time. And the mapping of “high level strategies” to “low level entities” would be a per-model bit of interpretability work, but should become more tractable to the extent that those low level entities are already mapped and understood.
So for the explicit problem that the ELK paper was trying to solve, if you are confident you know what underlying representation SmartVault is using, it’s much easier to interpret its higher-level actions/strategies.
Would you care to flesh this assertion out a bit more?
To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.
As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. Obviously you’d need higher order components too. Like I said, I am imagining here that the majority of the model is “off the shelf”, and just a thin layer is usecase-specific.
To make this more explicit, if you had not only off-the-shelf visual cortex, but also spatio-temporal reasoning modules built atop (as the human brain does), then you could point your debugger at the contents of that module and understand what entities in space were being perceived at what time. And the mapping of “high level strategies” to “low level entities” would be a per-model bit of interpretability work, but should become more tractable to the extent that those low level entities are already mapped and understood.
So for the explicit problem that the ELK paper was trying to solve, if you are confident you know what underlying representation SmartVault is using, it’s much easier to interpret its higher-level actions/strategies.