If mechanistic interpretability methods cannot prevent that interactions of AGI necessarily converge on total human extinction beyond theoretical limits of controllability, it means that these (or other “inspect internals”) methods cannot contribute to long-term AGI safety. And this is not idle speculation, nor based on prima facie arguments. It is based on 15 years of research by a polymath working outside this community.
In that sense, it would not really matter that mechanistic interpretability can do an okay job at detecting that a power-seeking AI was explicitly plotting to overthrow humanity.
That is, except for the extremely unlikely case you pointed to that such intentions are detected and on time, and humans all coordinate at once to impose an effective moratorium on scaling or computing larger models. But this is actually speculation, whereas that OpenAI promoted Olah’s fascinating Microscope-generated images as them making progress on understanding and aligning scalable ML models is not speculation.
Overall, my sense is that mechanistic interpretability is used to align-wash capability progress towards AGI, while not contributing to safety where it predominantly matters.
Removing intentional deception or harm greatly increases the capability of AIs that can be worked with without getting killed, to further improve safety measures.
Exactly this kind of thinking is what I am concerned about. It implicitly assumes that you have a (sufficiently) comprehensive and sound understanding of the ways humans would get killed at a given level of capability, and therefore can rely on that understanding to conclude that capabilities of AIs can be greatly increased without humans getting killed.
How do you think capability developers would respond to that statement? Will they just stay on the safe side, saying “Well those alignment researchers say that mechanistic interpretability helps remove intentional deception or harm, but I’m just going to stay on the safe side and not scale any further”. No, they are going to use your statement to promote the potential safety of their scalable models, and remove whatever safety margin they can justify themselves taking and feel justified taking for themselves.
Not considering unknown unknowns is going to get us killed. Not considering what safety problems may be unsolvable is going to get us killed.
Age-old saying: “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”
> What is your background? > How is it relevant to the work > you are planning to do?
Years ago, we started with a strong focus on civilization design and mitigating x-risk. These are topics that need and require more generalist capabilities, in many fields, not just single specialist capabilities, in any one single field of study or application.
Hence, as generalists, we are not specifically persons who are career mathematicians, nor even career physicists, chemists, or career biologists, anthropologists, or even career philosophers. Yet when considering the needs of topics civ-design and/or x-risk, it is very abundantly clear that some real skill and expertise is actually needed in all of these fields.
Understanding anything about x-risk and/or civilization means needing to understand key topics regarding large scale institutional process, ie; things like governments, businesses, university, constitutional law, social contract theory, representative process, legal and trade agreements, etc.
Yet people who study markets, economics, and politics (theory of groups, firms, etc) who do not also have some real grounding in actual sociology and anthropology, are not going to have grounding in understanding why things happen in the real world as they tend to do.
And those people are going to need to understand things like psychology, developmental psych, theory of education, interpersonal relationships, attachment, social communication dynamics, health of family and community, trauma, etc.
And understanding *those* topics means having a real grounding in evolutionary theory, bio-systems, ecology, biology, neurochemestry and neurology, ecosystem design, permaculture, and evolutionary psychology, theory of bias, etc.
It is hard to see that we would be able to assess things like ‘sociological bias’ as impacting possible mitigation strategies of x-risk, if we do not actually also have some real and deep, informed, and realistic accounting of the practical implications of, in the world, of *all* of these categories of ideas.
And yet, unfortunately, that is not all, since understanding of *those* topics themselves means even more and deeper grounding in things like organic and inorganic chemistry, cell process, and the underlying *physics* of things like that. Which therefore includes a fairly general understanding of multiple diverse areas of physics (mechanical, thermal, electromagnetic, QM, etc), and thus also of technology—since that is directly connected to business, social systems, world systems infrastructure, internet, electrical grid and energy management, transport (for fuel, materials, etc), and even more politics, advertising and marketing, rhetorical process and argumentation, etc.
Oh, and of course, a deep and applied practical knowledge of ‘computer science’, since nearly everything in the above is in one way or another “done with computers”. Maybe, of course, that would also be relevant when considering the specific category of x-risk which happens to involve computational concepts when thinking about artificial superintelligence.
I *have* been a successful practicing engineer in both large scale US-gov deployed software and also in product design shipped to millions. I have personally written more than 900,000 lines of code (mostly Ansi-C, ASM, Javascript) and have been ‘the principle architect’ in a team. I have developed my own computing environments, languages, procedural methodologies, and system management tactics, over multiple process technologies in multiple applied contexts. I have a reasonably thorough knowledge of CS. Including the modeling math, control theory, etc. Ie, I am legitimately “full stack” engineering from the physics of transistors, up through CPU design, firmware and embedded systems, OS level work, application development, networking, user interface design, and the social process implications of systems. I have similarly extensive accomplishments in some of the other listed disciplines also.
As such, as a proven “career” generalist, I am also (though not just) a master craftsman, which includes things like practical knowledge of how to negotiate contracts, write all manner documents, make all manner of things, *and* understand the implications of *all* of this in the real world, etc.
For the broad category of valid and reasonable x-risk assessment, that nothing less than at least some true depth in nearly *all* of these topics, will do.
From Math Expectations, a depersonalised post Forrest wrote of his impressions of a conversation with a grant investigator where the grant investigator kept looping back on the expectation that a “proof” based on formal reasoning must be written in mathematical notation. We did end up receiving the $170K grant.
I usually do not mention Forrest Landry’s name immediately for two reasons:
If you google his name, he comes across like a spiritual hippie. Geeks who don’t understand his use of language take that as a cue that he must not know anything about computational science, mathematics or physics (wrong – Forrest has deep insights into programming methods and eg. why Bell’s Theorem is a thing) .
Forrest prefers to work on the frontiers of research, rather than repeating himself in long conversations with tech people who cannot let go off their own mental models and quickly jump to motivated counterarguments that he heard and addressed many times before. So I act as a bridge-builder, trying to translate between Forrest speak and Alignment Forum speak.
Both of us prefer to work behind the scenes. I’ve only recently started to touch on the arguments in public.
You can find those arguments elaborated on here. Warning: large inferential distance; do message clarifying questions – I’m game!
If mechanistic interpretability methods cannot prevent that interactions of AGI necessarily converge on total human extinction beyond theoretical limits of controllability, it means that these (or other “inspect internals”) methods cannot contribute to long-term AGI safety. And this is not idle speculation, nor based on prima facie arguments. It is based on 15 years of research by a polymath working outside this community.
In that sense, it would not really matter that mechanistic interpretability can do an okay job at detecting that a power-seeking AI was explicitly plotting to overthrow humanity.
That is, except for the extremely unlikely case you pointed to that such intentions are detected and on time, and humans all coordinate at once to impose an effective moratorium on scaling or computing larger models. But this is actually speculation, whereas that OpenAI promoted Olah’s fascinating Microscope-generated images as them making progress on understanding and aligning scalable ML models is not speculation.
Overall, my sense is that mechanistic interpretability is used to align-wash capability progress towards AGI, while not contributing to safety where it predominantly matters.
Exactly this kind of thinking is what I am concerned about. It implicitly assumes that you have a (sufficiently) comprehensive and sound understanding of the ways humans would get killed at a given level of capability, and therefore can rely on that understanding to conclude that capabilities of AIs can be greatly increased without humans getting killed.
How do you think capability developers would respond to that statement? Will they just stay on the safe side, saying “Well those alignment researchers say that mechanistic interpretability helps remove intentional deception or harm, but I’m just going to stay on the safe side and not scale any further”. No, they are going to use your statement to promote the potential safety of their scalable models, and remove whatever safety margin they can justify themselves taking and feel justified taking for themselves.
Not considering unknown unknowns is going to get us killed. Not considering what safety problems may be unsolvable is going to get us killed.
Age-old saying: “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”
Sorry if I missed it earlier in the thread, but who is this “polymath”?
Forrest Landry.
From Math Expectations, a depersonalised post Forrest wrote of his impressions of a conversation with a grant investigator where the grant investigator kept looping back on the expectation that a “proof” based on formal reasoning must be written in mathematical notation. We did end up receiving the $170K grant.
I usually do not mention Forrest Landry’s name immediately for two reasons:
If you google his name, he comes across like a spiritual hippie. Geeks who don’t understand his use of language take that as a cue that he must not know anything about computational science, mathematics or physics (wrong – Forrest has deep insights into programming methods and eg. why Bell’s Theorem is a thing) .
Forrest prefers to work on the frontiers of research, rather than repeating himself in long conversations with tech people who cannot let go off their own mental models and quickly jump to motivated counterarguments that he heard and addressed many times before. So I act as a bridge-builder, trying to translate between Forrest speak and Alignment Forum speak.
Both of us prefer to work behind the scenes. I’ve only recently started to touch on the arguments in public.
You can find those arguments elaborated on here.
Warning: large inferential distance; do message clarifying questions – I’m game!