The idea that alignment research constitutes some kind of “search for a solution” within the discipline of cognitive science is wrong.
Let’s stick with the Open Agency Architecture (OAA), not because I particularly endorse it (I have nothing to say about it actually), but because it suits my purposes.
We need to predict the characteristics that civilisational intelligence architectures like OAA will have, characteristics such as robustness of alignment with humans, robustness/resilience in a more general sense (for example, in the face of external shocks, such as a supernova explosion close to the Solar system), scalability beyond Earth (and, perhaps, the Solar system), the flexibility of values (i. e., the characteristic opposite of value lock-in, Riedel (2021)), the presence and the characteristics of global consciousness, and, perhaps, some other ethical desiderata. And we need to do this not just for OAA, but for many different proposed architectures.
Analysing all these candidate architectures requires a lot of multi-disciplinary work, most of which could be done right now, applying the current state-of-the-art theories in the respective disciplines (the disciplines which I enumerated in the comment above, and more) that we currently have. Furthermore, we shouldn’t only apply a single theory of, for example, cognitive science or ethics or robust control in our analysis: we absolutely have to apply different “competing” theories in the respective disciplines, e.g., competing theories of intelligence/agency, to see what predictions these different theories will yield, to increase our down the line chances of survival (There are at least five promising general theories of intelligence/agency[1] apart from Infra-Bayesianism and other stuff developed by AI alignment researchers. Of course, there are even more theories of ethics. And there are at least multiple theories of robust control.)
It doesn’t matter that these theories are currently all “half-baked” and sometimes “off”: all naturalistic theories are “wrong”, and will be wrong (Deutsch even suggested to call them “misconceptions” instead of “theories”). Ultimately, building aligned AGI is an engineering, and hence naturalistic endeavour, so creating any single mathematical theory, no matter how self-consistent, could be just a small part of the story. You also must have a naturalistic theory (or, realistically, a patchwork of many theories from many disciplines, again, see above) of how the mathematical construct is implemented in real life, whether in computers, people, their groups and interactions, etc. (Yes, I oppose Vanessa Kosoy’s “cryptography” analogy here, I think it’s methodologically confused.)
So, how much effort of alignment researchers should go into building a multi-disciplinary (and cross-theoretic, within some particularly “important” disciplines, such as cognitive science!) understanding of various alignment proposals/paradigms (a.k.a. civilisational intelligence architectures, I maintain that any alignment proposal that aims lower than that is confused about the real subject of what we are trying to do), and how much should go towards developing new theories of intelligence/cognition/agency, in addition to those at least five academic ones (and I count only serious and well-developed ones), and maybe five more that were proposed within the alignment community already[2]? What is ROI of doing this or that type of research? Which type of research attacks (buys down) larger chunks of risk? Which type of research is more likely to be superseded later, and which will likely remain useful regardless?
If you consider all these questions, I think you will arrive at the conclusion that much more effort should go into multi-disciplinary R&D of alignment proposals/paradigms/plans/architectures, than into shooting at creating new cognitive science theories. Especially considering that academics already do the second type of work (after all, academics came up with those “five theories”[1], and there is a lot of work behind all of them), but don’t do the first kind of research. It’s only up to alignment researchers to do it.
To sum up: it’s not me who suggests “narrowing the search”. The kind of search you were hinting at (within the disciplines of cognitive science and rationality) is already narrow by design. I rather suggest widening the perspective and the nomenclature of disciplines that AI alignment researchers are seriously engaging with.
Note that in this question, I don’t consider how the total amount of effort that should go to either of these research types should compare with the amount of effort which goes into mechanistic interpretability research. I don’t know.
The idea that alignment research constitutes some kind of “search for a solution” within the discipline of cognitive science is wrong.
Let’s stick with the Open Agency Architecture (OAA), not because I particularly endorse it (I have nothing to say about it actually), but because it suits my purposes.
We need to predict the characteristics that civilisational intelligence architectures like OAA will have, characteristics such as robustness of alignment with humans, robustness/resilience in a more general sense (for example, in the face of external shocks, such as a supernova explosion close to the Solar system), scalability beyond Earth (and, perhaps, the Solar system), the flexibility of values (i. e., the characteristic opposite of value lock-in,
Riedel (2021)
), the presence and the characteristics of global consciousness, and, perhaps, some other ethical desiderata. And we need to do this not just for OAA, but for many different proposed architectures.Analysing all these candidate architectures requires a lot of multi-disciplinary work, most of which could be done right now, applying the current state-of-the-art theories in the respective disciplines (the disciplines which I enumerated in the comment above, and more) that we currently have. Furthermore, we shouldn’t only apply a single theory of, for example, cognitive science or ethics or robust control in our analysis: we absolutely have to apply different “competing” theories in the respective disciplines, e.g., competing theories of intelligence/agency, to see what predictions these different theories will yield, to increase our down the line chances of survival (There are at least five promising general theories of intelligence/agency[1] apart from Infra-Bayesianism and other stuff developed by AI alignment researchers. Of course, there are even more theories of ethics. And there are at least multiple theories of robust control.)
It doesn’t matter that these theories are currently all “half-baked” and sometimes “off”: all naturalistic theories are “wrong”, and will be wrong (Deutsch even suggested to call them “misconceptions” instead of “theories”). Ultimately, building aligned AGI is an engineering, and hence naturalistic endeavour, so creating any single mathematical theory, no matter how self-consistent, could be just a small part of the story. You also must have a naturalistic theory (or, realistically, a patchwork of many theories from many disciplines, again, see above) of how the mathematical construct is implemented in real life, whether in computers, people, their groups and interactions, etc. (Yes, I oppose Vanessa Kosoy’s “cryptography” analogy here, I think it’s methodologically confused.)
So, how much effort of alignment researchers should go into building a multi-disciplinary (and cross-theoretic, within some particularly “important” disciplines, such as cognitive science!) understanding of various alignment proposals/paradigms (a.k.a. civilisational intelligence architectures, I maintain that any alignment proposal that aims lower than that is confused about the real subject of what we are trying to do), and how much should go towards developing new theories of intelligence/cognition/agency, in addition to those at least five academic ones (and I count only serious and well-developed ones), and maybe five more that were proposed within the alignment community already[2]? What is ROI of doing this or that type of research? Which type of research attacks (buys down) larger chunks of risk? Which type of research is more likely to be superseded later, and which will likely remain useful regardless?
If you consider all these questions, I think you will arrive at the conclusion that much more effort should go into multi-disciplinary R&D of alignment proposals/paradigms/plans/architectures, than into shooting at creating new cognitive science theories. Especially considering that academics already do the second type of work (after all, academics came up with those “five theories”[1], and there is a lot of work behind all of them), but don’t do the first kind of research. It’s only up to alignment researchers to do it.
To sum up: it’s not me who suggests “narrowing the search”. The kind of search you were hinting at (within the disciplines of cognitive science and rationality) is already narrow by design. I rather suggest widening the perspective and the nomenclature of disciplines that AI alignment researchers are seriously engaging with.
These five theories, for reference, are Active Inference (
Fields et al. (2022)
,Friston et al. (2022)
), MCR^2 (Ma et al. (2022)
), thermodynamic ML (Boyd et al. (2022)
), “Bengio’s views on intelligence/agency” (see, for example,Goyal & Bengio (2022)
), and “LeCun’s views on intelligence/agency” (LeCun (2022)
). Maybe I’m still missing important theories, please let me know.Note that in this question, I don’t consider how the total amount of effort that should go to either of these research types should compare with the amount of effort which goes into mechanistic interpretability research. I don’t know.