I think you can loosen (b) quite a bit if you task a separate model with “delineating” the concept in the new network. The procedure does effectively give you access to infinite data, so the boundary for the old concept in the new model can be as complicated as your compute budget allows. Up to and including identifying high level concepts in low level physics simulations.
We currently have no criteria by which to judge the performance of such a separate model. What do we train it to do, exactly? We could make up some ad-hoc criterion, but that suffers from the usual problem of ad-hoc criteria: we won’t have a reliable way to know in advance whether it will or will not work on any particular problem or in any particular case.
The way I was envisioning it is that if you had some easily identifiable concept in one model, e.g. a latent dimension/feature that corresponds to the log odd of something being in a picture, you would train the model to match the behaviour of that feature when given data from the original generative model. Theoretically any loss function will do as long as the optimum corresponds to the situation where your “classifier” behaves exactly like the original feature in the old model when both of them are looking at the same data.
In practice though, we’re compute bound and nothing is perfect and so you need to answer other questions to determine the objective. Most of them will be related to why you need to be able to point at the original concept of interest in the first place. The acceptability of misclassifying any given input or world-state as being or not being an example of the category of interest is going to depend heavily on things like the cost of false positives/negatives and exactly which situations get misclassified by the model.
The thing about it working or not working is a good point though, and how to know that we’ve successfully mapped a concept would require a degree of testing, and possibly human judgement. You could do this by looking for situations where the new and old concepts don’t line up, and seeing what inputs/world states those correspond to, possibly interpreted through the old model with more human understandable concepts.
I will admit upon further reflection that the process I’m describing is hacky, but I’m relatively confident that the general idea would be a good approach to cross-model ontology identification.
I think you can loosen (b) quite a bit if you task a separate model with “delineating” the concept in the new network. The procedure does effectively give you access to infinite data, so the boundary for the old concept in the new model can be as complicated as your compute budget allows. Up to and including identifying high level concepts in low level physics simulations.
We currently have no criteria by which to judge the performance of such a separate model. What do we train it to do, exactly? We could make up some ad-hoc criterion, but that suffers from the usual problem of ad-hoc criteria: we won’t have a reliable way to know in advance whether it will or will not work on any particular problem or in any particular case.
The way I was envisioning it is that if you had some easily identifiable concept in one model, e.g. a latent dimension/feature that corresponds to the log odd of something being in a picture, you would train the model to match the behaviour of that feature when given data from the original generative model. Theoretically any loss function will do as long as the optimum corresponds to the situation where your “classifier” behaves exactly like the original feature in the old model when both of them are looking at the same data.
In practice though, we’re compute bound and nothing is perfect and so you need to answer other questions to determine the objective. Most of them will be related to why you need to be able to point at the original concept of interest in the first place. The acceptability of misclassifying any given input or world-state as being or not being an example of the category of interest is going to depend heavily on things like the cost of false positives/negatives and exactly which situations get misclassified by the model.
The thing about it working or not working is a good point though, and how to know that we’ve successfully mapped a concept would require a degree of testing, and possibly human judgement. You could do this by looking for situations where the new and old concepts don’t line up, and seeing what inputs/world states those correspond to, possibly interpreted through the old model with more human understandable concepts.
I will admit upon further reflection that the process I’m describing is hacky, but I’m relatively confident that the general idea would be a good approach to cross-model ontology identification.