My initial focus was on preventing re-using predictive models of humans. But I’m feeling increasingly like there is going to be a single solution to the two problems, and that the world-model mismatch problem is a good domain to develop the kind of algorithm we need. I want to say a bit about why.
I’m currently thinking about dealing with world model mismatches by learning a correspondence between models using something other than a simplicity prior / training a neural network to answering questions. Intuitively we want to do something more like “lining up” the two models and seeing what parts correspond to which others. We have a lot of conditions/criteria for such alignments, so we don’t necessarily have to just stick with simplicity. This comment fleshes out one possible approach a little bit.
If this approach succeeds, then it also directly applicable to avoiding re-using human models—we want to be lining up the internal computation of our model with concepts like “There is a cat in the room” rather than just asking the model to predict whether there is a cat however it wants (which it may do by copying a human labeler). And on the flip side, I think that the “re-using human models” problem is a good constraint to have in mind when thinking about ways to do this correspondence. (Roughly speaking, because something like computational speed or “locality” seems like a really central constraint for matching up world models, and doing that approach naively can greatly exacerbate the problems with copying the training process.)
So for now I think it makes sense for me to focus on whether learning this correspondence is actually plausible. If that succeeds then I can step back and see how that changes my overall view of the landscape (I think it might be quite a significant change), and if it fails then I hope to at least know a bit more about the world model mismatch problem.
I think the best analogy in existing practice is probably doing interpretability work—mapping up the AI’s model to my model is kind of like looking at neurons and trying to make sense of what they are computing (or looking for neurons that compute something). And giving up on a “simplicity prior” is very natural when doing interpretability, instead using other considerations to determine whether a correspondence is good. It still seems kind of plausible that in retrospect my current work will look like it was trying to get a solid theoretical picture on what interpretability should do (including in the regime where the correspondence is quite complex, and when the goal is a much more complete level of understanding). I swing back and forth on how strong the analogy to interpretability seems / whether or not this is how it will look in retrospect. (But at any rate, my research methodology feels like a very different approach to similar questions.)
Recently I’ve been thinking about ML systems that generalize poorly (copying human errors) because of either re-using predictive models of humans or using human inference procedures to map between world models.
My initial focus was on preventing re-using predictive models of humans. But I’m feeling increasingly like there is going to be a single solution to the two problems, and that the world-model mismatch problem is a good domain to develop the kind of algorithm we need. I want to say a bit about why.
I’m currently thinking about dealing with world model mismatches by learning a correspondence between models using something other than a simplicity prior / training a neural network to answering questions. Intuitively we want to do something more like “lining up” the two models and seeing what parts correspond to which others. We have a lot of conditions/criteria for such alignments, so we don’t necessarily have to just stick with simplicity. This comment fleshes out one possible approach a little bit.
If this approach succeeds, then it also directly applicable to avoiding re-using human models—we want to be lining up the internal computation of our model with concepts like “There is a cat in the room” rather than just asking the model to predict whether there is a cat however it wants (which it may do by copying a human labeler). And on the flip side, I think that the “re-using human models” problem is a good constraint to have in mind when thinking about ways to do this correspondence. (Roughly speaking, because something like computational speed or “locality” seems like a really central constraint for matching up world models, and doing that approach naively can greatly exacerbate the problems with copying the training process.)
So for now I think it makes sense for me to focus on whether learning this correspondence is actually plausible. If that succeeds then I can step back and see how that changes my overall view of the landscape (I think it might be quite a significant change), and if it fails then I hope to at least know a bit more about the world model mismatch problem.
I think the best analogy in existing practice is probably doing interpretability work—mapping up the AI’s model to my model is kind of like looking at neurons and trying to make sense of what they are computing (or looking for neurons that compute something). And giving up on a “simplicity prior” is very natural when doing interpretability, instead using other considerations to determine whether a correspondence is good. It still seems kind of plausible that in retrospect my current work will look like it was trying to get a solid theoretical picture on what interpretability should do (including in the regime where the correspondence is quite complex, and when the goal is a much more complete level of understanding). I swing back and forth on how strong the analogy to interpretability seems / whether or not this is how it will look in retrospect. (But at any rate, my research methodology feels like a very different approach to similar questions.)