I continue to think that agent foundations research is kind of underrated. Like, we’re supposed to do mechinterp to understand the algorithm models implement—but how do we know what algorithms are good?
It additionally seems likely to me that we are presently missing major parts of a decent language for talking about minds/models, and developing such a language requires (and would constitute) significant philosophical progress. There are ways to ‘understand the algorithm a model is’ that are highly insufficient/inadequate for doing what we want to do in alignment — for instance, even if one gets from where interpretability is currently to being able to replace a neural net by a somewhat smaller boolean (or whatever) circuit and is thus able to translate various NNs to such circuits and proceed to stare at them, one probably won’t thereby be more than 110 of the way to the kind of strong understanding that would let one modify a NN-based AGI to be aligned or build another aligned AI (in case alignment doesn’t happen by default) (much like how knowing the weights doesn’t deliver that kind of understanding). To even get to the point where we can usefully understand the ‘algorithms’ models implement, I feel like we might need to have answered sth like (1) what kind of syntax should we see thinking as having — for example, should we think of a model/mind as a library of small programs/concepts that are combined and updated and created according to certain rules (Minsky’s frames?), or as having a certain kind of probabilistic world model that supports planning in a certain way, or as reasoning in a certain internal logical language, or in terms of having certain propositional attitudes; (2) what kind of semantics should we see thinking as having — what kind of correspondence between internals of the model/mind and the external world should we see a model as maintaining(; also, wtf are values). I think that trying to find answers to these questions by ‘just looking’ at models in some ML-brained, non-philosophical way is unlikely to be competitive with trying to answer these questions with an attitude of taking philosophy (agent foundations) seriously, because one will only have any hope of seeing the cognitive/computational structure in a mind/model by staring at it if one stares at it already having some right ideas about what kind of structure to look for. For example, it’d be very tough to try to discover [first-order logic]/ZFC/[type theory] by staring at the weights/activations/whatever of the brain of a human mathematician doing mathematical reasoning, from a standpoint where one hasn’t already invented [first-order logic]/ZFC/[type theory] via some other route — if one starts from the low-level structure of a brain, then first-order logic will only appear as being implemented in the brain in some ‘highly encrypted’ way.
There’s really a spectrum of claims here that would all support the claim that agent foundations is good for understanding the ‘algorithm’ a model/mind is to various degrees. A stronger one than what I’ve been arguing for is that once one has these ideas, one needn’t stare at models at all, and that staring at models is unlikely to help one get the right ideas (e.g. because it’s better to stare at one’s own thinking instead, and to think about how one could/should think, sort of like how [first-order logic]/ZFC/[type theory] was invented), so one’s best strategy does not involve starting at models; a weaker one than what I’ve been arguing is that having more and better ideas about the structure of minds would be helpful when staring at models. I like TsviBT’s koan on this topic.
I continue to think that agent foundations research is kind of underrated. Like, we’re supposed to do mechinterp to understand the algorithm models implement—but how do we know what algorithms are good?
It additionally seems likely to me that we are presently missing major parts of a decent language for talking about minds/models, and developing such a language requires (and would constitute) significant philosophical progress. There are ways to ‘understand the algorithm a model is’ that are highly insufficient/inadequate for doing what we want to do in alignment — for instance, even if one gets from where interpretability is currently to being able to replace a neural net by a somewhat smaller boolean (or whatever) circuit and is thus able to translate various NNs to such circuits and proceed to stare at them, one probably won’t thereby be more than 110 of the way to the kind of strong understanding that would let one modify a NN-based AGI to be aligned or build another aligned AI (in case alignment doesn’t happen by default) (much like how knowing the weights doesn’t deliver that kind of understanding). To even get to the point where we can usefully understand the ‘algorithms’ models implement, I feel like we might need to have answered sth like (1) what kind of syntax should we see thinking as having — for example, should we think of a model/mind as a library of small programs/concepts that are combined and updated and created according to certain rules (Minsky’s frames?), or as having a certain kind of probabilistic world model that supports planning in a certain way, or as reasoning in a certain internal logical language, or in terms of having certain propositional attitudes; (2) what kind of semantics should we see thinking as having — what kind of correspondence between internals of the model/mind and the external world should we see a model as maintaining(; also, wtf are values). I think that trying to find answers to these questions by ‘just looking’ at models in some ML-brained, non-philosophical way is unlikely to be competitive with trying to answer these questions with an attitude of taking philosophy (agent foundations) seriously, because one will only have any hope of seeing the cognitive/computational structure in a mind/model by staring at it if one stares at it already having some right ideas about what kind of structure to look for. For example, it’d be very tough to try to discover [first-order logic]/ZFC/[type theory] by staring at the weights/activations/whatever of the brain of a human mathematician doing mathematical reasoning, from a standpoint where one hasn’t already invented [first-order logic]/ZFC/[type theory] via some other route — if one starts from the low-level structure of a brain, then first-order logic will only appear as being implemented in the brain in some ‘highly encrypted’ way.
There’s really a spectrum of claims here that would all support the claim that agent foundations is good for understanding the ‘algorithm’ a model/mind is to various degrees. A stronger one than what I’ve been arguing for is that once one has these ideas, one needn’t stare at models at all, and that staring at models is unlikely to help one get the right ideas (e.g. because it’s better to stare at one’s own thinking instead, and to think about how one could/should think, sort of like how [first-order logic]/ZFC/[type theory] was invented), so one’s best strategy does not involve starting at models; a weaker one than what I’ve been arguing is that having more and better ideas about the structure of minds would be helpful when staring at models. I like TsviBT’s koan on this topic.
Not only “good ”, but “obedient”, “non-deceptive”, “minimal impact”, “behaviorist” and don’t even talk about “mindcrime”.
In this sense agent foundations research seems similar to research on normative ethics.