John wrote an explosive postmortem on the alignment field, boldy proclaiming that almost all alignment research is trash. John held the ILIAD conference [which I helped organize] as one of the few examples of places where research is going in the right direction. While I share some of his concerns about the field’s trajectory, and I am flattered that ILIAD was appreciated, I feel ambivalent about ILIAD being pulled into what I can only describe as an alignment culture war.
There’s plenty to criticise about mainstream alignment research but blanket dismissals feel silly to me? Sparse auto-encoders are exciting! Research on delegated oversight & safety-by-debate is vitally important. Scary demos isn’t exciting as Deep Science but its influence on policy is probably much greater than that long-form essay on conceptual alignment. AI psychology doesn’t align with a physicist’s aesthetic but as alignment is ultimately about attitudes of artifical intelligences maybe just talking with Claude about his feelings might prove valuable. There’s lots of experimental work in mainstream ML on deep learning that will be key to constructing a grounded theory of deep learning. And I’m sure there is a ton more I am not familiar with.
Beyond being an unfair and uninformed dismissal of a lot of solid work, it risks unnecessarily antagonizing people—making it even harder to advocate theoretical research like agent foundations.
Humility is no sin. I sincerely believe mathematical and theory-grounded research programmes in alignment are neglected, tractable and important, potentially even crucial. Yet I’ll be the first to acknowledge that there are many worlds in which it is too late or fundamentally unable to deliver on its promise while prosaic alignment ideas do. And in worlds in which theory does bear fruit - ultimately that will be through engaging with pretty mundane, prosaic things.
What’s concerning is watching a certain strain of dismissiveness towards mainstream ideas calcify within parts of the rationalist ecosystem. As Vanessa notes in her comment, this attitude of isolation and attendant self-satisfied sense of superiority certainly isn’t new. It has existed for a while around MIRI & the rationalist community. Yet it appears to be intensifying as AI safety becomes more mainstream and the rationalist community’s relative influence decreases
[1]
.
I liked this comment by Adam Shai (shared with permission):
If one disagrees with the mainstream approach then its on you (talking to myself!) to _show it_, or better yet to do the thing _better_. Being convincing to others often requires operationalizing your ideas in a tangible situation/experiment/model, and actually isn’t just a politically useful tool, it’s one of the main mechanism by which you can reality check yourself. It’s very easy to get caught up in philosophically beautiful ideas and to trick oneself. The test is what you can do with the ideas. The mainstream approach is great because it actually does stuff! It finds latents in actually existing networks, it shows by example situations that feel concerning, etc. etc.
I disagree with many aspects of the mainstream approach, but I also have a more global belief that the mainstream is a mainstream for a good reason! And those of us that disagree with it, or think too many people are going that route, should be careful not to box oneself into a predetermined and permanent social role of “outsider who makes no real progress even if they talk about cool stuff”
I think that there are two key questions we should be asking:
Where is the value of a an additional researcher higher on the margin?
What should the field look like in order to make us feel good about the future?
I agree that “prosaic” AI safety research is valuable. However, at this point it’s far less neglected than foundational/theoretical research and the marginal benefits there are much smaller. Moreover, without significant progress on the foundational front, our prospects are going to be poor, ~no matter how much mech-interp and talking to Claude about feelings we will do.
John has a valid concern that, as the field becomes dominated by the prosaic paradigm, it might become increasingly difficult to get talent and resources to the foundational side, or maintain memetically healthy coherent discourse. As to the tone, I have mixed feelings. Antagonizing people is bad, but there’s also value in speaking harsh truths the way you see them. (That said, there is room in John’s post for softening the tone without losing much substance.)
Scary demos isn’t exciting as Deep Science but its influence on policy
There maybe should be a standardly used name for the field of generally reducing AI x-risk, which would include governance, policy, evals, lobbying, control, alignment, etc., so that “AI alignment” can be a more narrow thing. I feel (coarsely speaking) grateful toward people working on governance, policy, evals_policy, lobbying; I think control is pointless or possibly bad (makes things look safer than they are, doesn’t address real problem); and frustrated with alignment.
What’s concerning is watching a certain strain of dismissiveness towards mainstream ideas calcify within parts of the rationalist ecosystem. As Vanessa notes in her comment, this attitude of isolation and attendant self-satisfied sense of superiority certainly isn’t new. It has existed for a while around MIRI & the rationalist community. Yet it appears to be intensifying as AI safety becomes more mainstream and the rationalist community’s relative influence decreases
What should one do, who:
thinks that there’s various specific major defeaters to the narrow project of understanding how to align AGI;
finds partial consensus with some other researchers about those defeaters;
upon explaining these defeaters to tens or hundreds of newcomers, finds that, one way or another, they apparently-permanently fail to avoid being defeated by those defeaters?
It sounds like in this paragraph your main implied recommendation is “be less snooty”. Is that right?
The first moral that I’d draw is simple but crucial: If you’re trying to understand some phenomenon by interpreting some data, the kind of data you’re interpreting is key. It’s not enough for the data to be tightly related to the phenomenon——or to be downstream of the phenomenon, or enough to pin it down in the eyes of Solomonoff induction, or only predictable by understanding it. If you want to understand how a computer operating system works by interacting with one, it’s far far better to interact with the operating at or near the conceptual/structural regime at which the operating system is constituted.
What’s operating-system-y about an operating system is that it manages memory and caching, it manages CPU sharing between process, it manages access to hardware devices, and so on. If you can read and interact with the code that talks about those things, that’s much better than trying to understand operating systems by watching capacitors in RAM flickering, even if the sum of RAM+CPU+buses+storage gives you a reflection, an image, a projection of the operating system, which in some sense “doesn’t leave anything out”. What’s mind-ish about a human mind is reflected in neural firing and rewiring, in that a difference in mental state implies a difference in neurons. But if you want to come to understand minds, you should look at the operations of the mind in descriptive and manipulative terms that center around, and fan out from, the distinctions that the mind makes internally for its own benefit. In trying to interpret a mind, you’re trying to get the theory of the program.
You’ll have to be a little more direct to get your point across I fear. I am sensing you think mechinterp, SLT, and neuroscience aren’t at a high enough level of abstraction. I am curious why you think so and would benefit from understanding more clearly what you are proposing instead.
They aren’t close to the right kind of abstraction. You can tell because they use a low-level ontology, such that mental content, to be represented there, would have to be homogenized, stripped of mental meaning, and encoded. Compare trying to learn about arithmetic, and doing so by explaining a calculator in terms of transistors vs. in terms of arithmetic. The latter is the right level of abstraction; the former is wrong (it would be right if you were trying to understand transistors or trying to understand some further implementational aspects of arithmetic beyond the core structure of arithmetic).
I think I disagree, or need some clarification. As an example, the phenomenon in question is that the physical features of children look more or less like combinations of the parents features. Is the right kind of abstraction a taxonomy and theory of physical features at the level of nose-shapes and eyebrow thickness? Or is it at the low-level ontology of molecules and genes, or is it in the understanding of how those levels relate to eachother?
I’m unsure whether it’s a good analogy. Let me make a remark, and then you could reask or rephrase.
The discovery that the phenome is largely a result of the genome, is of course super important for understanding and also useful. The discovery of mechanically how (transcribe, splice, translate, enhance/promote/silence, trans-regulation, …) the phenome is a result of the genome is separately important, and still ongoing. The understanding of “structurally how” characters are made, both in ontogeny and phylogeny, is a blob of open problems (evodevo, niches, …). Likewise, more simply, “structurally what”—how to even think of characters. Cf. Günter Wagner, Rupert Riedl.
I would say the “structurally how” and “structurally what” is most analogous. The questions we want to answer about minds aren’t like “what is a sufficient set of physical conditions to determine—however opaquely—a mind’s effects”, but rather “what smallish, accessible-ish, designable-ish structures in a mind can [understandably to us, after learning how] determine a mind’s effects, specifically as we think of those effects”. That is more like organology and developmental biology and telic/partial-niche evodevo (<-made up term but hopefully you see what I mean).
I suppose it depends on what one wants to do with their “understanding” of the system? Here’s one AI safety case I worry about: if we (humans) don’t understand the lower-level ontology that gives rise to the phenomenon that we are more directly interested in (in this case I think thats something like an AI systems behavior/internal “mental” states—your “structurally what”, if I’m understanding correctly, which to be honest I’m not very confident I am), then a sufficiently intelligent AI system that does understand that relationship will be able to exploit the extra degrees of freedom in the lower level ontology to our disadvantage, and we won’t be able to see it coming.
I very much agree that structurally what matters a lot, but that seems like half the battle to me.
I very much agree that structurally what matters a lot, but that seems like half the battle to me.
But somehow this topic is not afforded much care or interest. Some people will pay lip service to caring, others will deny that mental states exist, but either way the field of alignment doesn’t put much force (money, smart young/new people, social support) toward these questions. This is understandable, as we have much less legible traction on this topic, but that’s… undignified, I guess is the expression.
a sufficiently intelligent AI system that does understand that relationship will be able to exploit the extra degrees of freedom in the lower level ontology to our disadvantage, and we won’t be able to see it coming.
Even if you do understand the lower level, you couldn’t stop such an adversarial AI from exploiting it, or exploiting something else, and taking control. If you understand the mental states (yeah, the structure), then maybe you can figure out how to make an AI that wants to not do that. In other words, it’s not sufficient, and probably not necessary / not a priority.
This really clicked for me. I don’t blame you for making up the term because, although I can see the theory and examples of papers in that topic, I can’t think of a unifying term that isn’t horrendously broad (e.g. molecular ecology).
I.e. a training technique? Design principles? A piece of math ? Etc
All of those, sure? First you understand, then you know what to do. This is a bad way to do peacetime science, but seems more hopeful for
cruel deadline,
requires understanding as-yet-unconceived aspects of Mind.
I think I am asking a very fair question.
No, you’re derailing from the topic, which is the fact that the field of alignment keeps failing to even try to avoid / address major partial-consensus defeaters to alignment.
I’m confused why you are so confident in these “defeaters” by which I gather objection/counterarguments to certain lines of attack on the alignment problem.
E.g.
I doubt it would be good if the alignment community would outlaw mechinterp/slt/ neuroscience just because of some vague intuition that they don’t operate at the right abstraction.
Certainly, the right level of abstraction is a crucial concern but I dont think progress on this question will be made by blanket dismissals. People in these fields understand very well the problem you are pointing towards. Many people are thinking deeply how to resolve this issue.
More than any one defeater, I’m confident that most people in the alignment field don’t understand the defeaters. Why? I mean, from talking to many of them, and from their choices of research.
People in these fields understand very well the problem you are pointing towards.
I don’t believe you.
if the alignment community would outlaw mechinterp/slt/ neuroscience
This is an insane strawman. Why are you strawmanning what I’m saying?
I dont think progress on this question will be made by blanket dismissals
Progress could only be made by understanding the problems, which can only be done by stating the problems, which you’re calling “blanket dismissals”.
Defeater, in my mind, is a failure mode which if you don’t address you will not succeed at aligning sufficiently powerful systems.[1] It does not mean work outside of that focused on them is useless, but at some point you have to deal with the defeaters, and if the vast majority of people working towards alignment don’t get them clearly, and the people who do get them claim we’re nowhere near on track to find a way to beat the defeaters, then that is a scary situation.
This is true even if some of the work being done by people unaware of the defeaters is not useless, e.g. maybe it is successfully averting earlier forms of doom than the ones that require routing around the defeaters.
Not best considered as an argument against specific lines of attack, but as a problem which if unsolved leads inevitably to doom. People with a strong grok of a bunch of these often think that way more timelines are lost to “we didn’t solve these defeaters” than the problems being even plausibly addressed by the class of work being done by most of the field. This does unfortunately make it get used as (and feel like) an argument against those approaches by people who don’t and don’t claim to understand those approaches, but that’s not the generator or important nature of it.
Yeah, I meant that I use “AI x-safety” to refer to the field overall and “AI x-safety technical research” to specifically refer to technical research in that field (e.g. alignment research).
I’ve often preferred a frame of ‘catastrophe avoidance’ over a frame of x-risk. This has a possible downside of people underfeeling the magnitude of risk, but also an upside of IMO feeling way more plausible. I think it’s useful to not need to win specific arguments about extinction, and also to not have some of the existential/extinction conflation happening in ‘x-’.
FWIW this seems overall highly obfuscatory to me. Catastrophic clearly includes things like “A bank loses $500M” and that’s not remotely the same as an existential catastrophe.
It’s much more the same than a lot of prosaic safety though, right?
Let me put it this way: If an AI can’t achieve catastrophe on that order of magnitude, it also probably cannot do something truly existential.
One of the issues this runs into is if a misaligned AI is playing possum, and so doesn’t attempt lesser catastrophes until it can pull off a true takeover. I nonetheless though think this framing points generally at the right type of work (understood that others may disagree of course)
Not confident, but I think that “AIs that cause your civilization problems” and “AIs that overthrow your civilization” may be qualitatively different kinds of AIs. Regardlesss, existential threats are the most important thing here, and we just have a short term (‘x-risk’) that refers to that work.
And anyway I think the ‘catastrophic’ term is already being used to obfuscate, as Anthropic uses it exclusively on their website / in their papers, literally never talking about extinction or disempowerment[1], and we shouldn’t let them get away with that by also adopting their worse terminology.
Yes—the word ‘global’ is a minimum necessary qualification for referring to catastrophes of the type we plausibly care about—and even then, it is not always clear that something like COVID-19 was too small an event to qualify.
perhaps. but my reasoning is something like -
better than “alignment”: what’s being aligned? outcomes should be (citation needed)
better than “ethics”: how does one act ethically? by producing good outcomes (citation needed).
better than “notkilleveryoneism”: I actually would prefer everyone dying now to everyone being tortured for a million years and then dying, for example, and I can come up with many other counterexamples—not dying is not the problem, achieving good things is the problem.
might not work for deontologists. that seems fine to me, I float somewhere between virtue ethics and utilitarianism anyway.
perhaps there are more catchy words that could be used, though. hope to see someone suggest one someday.
[After I wrote down the thing, I became more uncertain about how much weight to give to it. Still, I think it’s a valid consideration to have on your list of considerations.]
“AI alignment”, “AI safety”, “AI (X-)risk”, “AInotkilleveryoneism”, “AI ethics” came to be associated with somewhat specific categories of issues. When somebody says “we should work (or invest more or spend more) on AI {alignment,safety,X-risk,notkilleveryoneism,ethics}”, they communicate that they are concerned about those issues and think that deliberate work on addressing those issues is required or otherwise those issues are probably not going to be addressed (to a sufficient extent, within relevant time, &c.).
“AI outcomes” is even broader/[more inclusive] than any of the above (the only step left to broaden it even further would be perhaps to say “work on AI being good” or, in the other direction, work on “technology/innovation outcomes”) and/but also waters down the issue even more. Now you’re saying “AI is not going to be (sufficiently) good by default (with various AI outcomes people having very different ideas about what makes AI likely not (sufficiently) good by default)”.
It feels like we’re moving in the direction of broadening our scope of consideration to (1) ensure we’re not missing anything, and (2) facilitate coalition building (moral trade?). While this is valid, it risks (1) failing to operate on the/an appropriate level of abstraction, and (2) diluting our stated concerns so much that coalition building becomes too difficult because different people/groups endorsing stated concerns have their own interpretations/beliefs/value systems. (Something something find an optimum (but also be ready and willing to update where you think the optimum lies when situation changes)?)
but how would we do high intensity, highly focused research on something intentionally restructured to be an “AI outcomes” research question? I don’t think this is pointless—agency research might naturally talk about outcomes in a way that is general across a variety of people’s concerns. In particular, ethics and alignment seem like they’re an unnatural split, and outcomes seems like a refactor that could select important problems from both AI autonomy risks and human agency risks. I have more specific threads I could talk about.
Why do you think it’s uninformed? John specifically says that he’s taking “this work is trash” as background and not trying to convince anyone who disagrees. It seems like because he doesn’t try, you assume he doesn’t have an argument?
it risks unnecessarily antagonizing people
I kinda think it was necessary. (In that, the thing ~needed to be written and “you should have written this with a lot less antagonism” is not a reasonable ask.)
1) “there are many worlds in which it is too late or fundamentally unable to deliver on its promise while prosaic alignment ideas do. And in worlds in which theory does bear fruit”—Yudkowsky had a post somewhere about you only getting to do one instance of deciding to act as if the world was like X. Otherwise you’re no longer affecting our actual reality. I’m not describing this well at all, but I found the initial point quite persuasive.
2) Highly relevant LW post & concept: The Tale of Alice Almost: Strategies for Dealing With Pretty Good People. People like Yudkowsky and johnswentworth think that vanishingly few people are doing something that’s genuinely helpful for reducing x-risk, and most people are doing things that are useless at best or actively harmful (by increasing capabilities) at worst. So how should they act towards those people? Well, as per the post, that depends on the specific goal:
Suppose you value some virtue V and you want to encourage people to be better at it. Suppose also you are something of a “thought leader” or “public intellectual” — you have some ability to influence the culture around you through speech or writing.
Suppose Alice Almost is much more V-virtuous than the average person — say, she’s in the top one percent of the population at the practice of V. But she’s still exhibited some clear-cut failures of V. She’s almost V-virtuous, but not quite.
How should you engage with Alice in discourse, and how should you talk about Alice, if your goal is to get people to be more V-virtuous?
Well, it depends on what your specific goal is.
...
What if Alice is Diluting Community Values?
Now, what if Alice Almost is the one trying to expand community membership to include people lower in V-virtue … and you don’t agree with that?
Now, Alice is your opponent.
In all the previous cases, the worst Alice did was drag down the community’s median V level, either directly or by being a role model for others. But we had no reason to suppose she was optimizing for lowering the median V level of the community. Once Alice is trying to “popularize” or “expand” the community, that changes. She’s actively trying to lower median V in your community — that is, she’s optimizing for the opposite of what you want.
The mainstream wins the war of ideas by default. So if you think everyone dies if the mainstream wins, then you must argue against the mainstream, right?
There’s plenty to criticise about mainstream alignment research
I’m curious what you think John’s valid criticisms are. His piece is so hyperbolic that I have to consider all arguments presented there somewhat suspect by default.
Edit: Clearly people disagree with this sentiment. I invite (and will strongly upvote) strong rebuttals.
John wrote an explosive postmortem on the alignment field, boldy proclaiming that almost all alignment research is trash. John held the ILIAD conference [which I helped organize] as one of the few examples of places where research is going in the right direction. While I share some of his concerns about the field’s trajectory, and I am flattered that ILIAD was appreciated, I feel ambivalent about ILIAD being pulled into what I can only describe as an alignment culture war.
There’s plenty to criticise about mainstream alignment research but blanket dismissals feel silly to me? Sparse auto-encoders are exciting! Research on delegated oversight & safety-by-debate is vitally important. Scary demos isn’t exciting as Deep Science but its influence on policy is probably much greater than that long-form essay on conceptual alignment. AI psychology doesn’t align with a physicist’s aesthetic but as alignment is ultimately about attitudes of artifical intelligences maybe just talking with Claude about his feelings might prove valuable. There’s lots of experimental work in mainstream ML on deep learning that will be key to constructing a grounded theory of deep learning. And I’m sure there is a ton more I am not familiar with.
Beyond being an unfair and uninformed dismissal of a lot of solid work, it risks unnecessarily antagonizing people—making it even harder to advocate theoretical research like agent foundations.
Humility is no sin. I sincerely believe mathematical and theory-grounded research programmes in alignment are neglected, tractable and important, potentially even crucial. Yet I’ll be the first to acknowledge that there are many worlds in which it is too late or fundamentally unable to deliver on its promise while prosaic alignment ideas do. And in worlds in which theory does bear fruit - ultimately that will be through engaging with pretty mundane, prosaic things.
What’s concerning is watching a certain strain of dismissiveness towards mainstream ideas calcify within parts of the rationalist ecosystem. As Vanessa notes in her comment, this attitude of isolation and attendant self-satisfied sense of superiority certainly isn’t new. It has existed for a while around MIRI & the rationalist community. Yet it appears to be intensifying as AI safety becomes more mainstream and the rationalist community’s relative influence decreases
[1]
.
I liked this comment by Adam Shai (shared with permission):
See also the confident pronouncements of certain doom in these quarters - surely just as silly as complete confidence in the impossibility of doom.
I think that there are two key questions we should be asking:
Where is the value of a an additional researcher higher on the margin?
What should the field look like in order to make us feel good about the future?
I agree that “prosaic” AI safety research is valuable. However, at this point it’s far less neglected than foundational/theoretical research and the marginal benefits there are much smaller. Moreover, without significant progress on the foundational front, our prospects are going to be poor, ~no matter how much mech-interp and talking to Claude about feelings we will do.
John has a valid concern that, as the field becomes dominated by the prosaic paradigm, it might become increasingly difficult to get talent and resources to the foundational side, or maintain memetically healthy coherent discourse. As to the tone, I have mixed feelings. Antagonizing people is bad, but there’s also value in speaking harsh truths the way you see them. (That said, there is room in John’s post for softening the tone without losing much substance.)
There maybe should be a standardly used name for the field of generally reducing AI x-risk, which would include governance, policy, evals, lobbying, control, alignment, etc., so that “AI alignment” can be a more narrow thing. I feel (coarsely speaking) grateful toward people working on governance, policy, evals_policy, lobbying; I think control is pointless or possibly bad (makes things look safer than they are, doesn’t address real problem); and frustrated with alignment.
What should one do, who:
thinks that there’s various specific major defeaters to the narrow project of understanding how to align AGI;
finds partial consensus with some other researchers about those defeaters;
upon explaining these defeaters to tens or hundreds of newcomers, finds that, one way or another, they apparently-permanently fail to avoid being defeated by those defeaters?
It sounds like in this paragraph your main implied recommendation is “be less snooty”. Is that right?
What is a defeater and can you give some examples ?
A thing that makes alignment hard / would defeat various alignment plans or alignment research plans.
E.g.s: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#Section_B_
E.g. the things you’re studying aren’t stable under reflection.
E.g. the things you’re studying are at the wrong level of abstraction (SLT, interp, neuro) https://www.lesswrong.com/posts/unCG3rhyMJpGJpoLd/koan-divining-alien-datastructures-from-ram-activations
E.g. https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html
This just in: Alignment researchers fail to notice skulls from famous blog post “Yes, we have noticed the skulls”.
“E.g. the things you’re studying are at the wrong level of abstraction (SLT, interp, neuro)”
Let’s hear it. What do you mean here exactly?
From the linked post:
You’ll have to be a little more direct to get your point across I fear.
I am sensing you think mechinterp, SLT, and neuroscience aren’t at a high enough level of abstraction. I am curious why you think so and would benefit from understanding more clearly what you are proposing instead.
They aren’t close to the right kind of abstraction. You can tell because they use a low-level ontology, such that mental content, to be represented there, would have to be homogenized, stripped of mental meaning, and encoded. Compare trying to learn about arithmetic, and doing so by explaining a calculator in terms of transistors vs. in terms of arithmetic. The latter is the right level of abstraction; the former is wrong (it would be right if you were trying to understand transistors or trying to understand some further implementational aspects of arithmetic beyond the core structure of arithmetic).
What I’m proposing instead, is theory.
I think I disagree, or need some clarification. As an example, the phenomenon in question is that the physical features of children look more or less like combinations of the parents features. Is the right kind of abstraction a taxonomy and theory of physical features at the level of nose-shapes and eyebrow thickness? Or is it at the low-level ontology of molecules and genes, or is it in the understanding of how those levels relate to eachother?
Or is that not a good analogy?
I’m unsure whether it’s a good analogy. Let me make a remark, and then you could reask or rephrase.
The discovery that the phenome is largely a result of the genome, is of course super important for understanding and also useful. The discovery of mechanically how (transcribe, splice, translate, enhance/promote/silence, trans-regulation, …) the phenome is a result of the genome is separately important, and still ongoing. The understanding of “structurally how” characters are made, both in ontogeny and phylogeny, is a blob of open problems (evodevo, niches, …). Likewise, more simply, “structurally what”—how to even think of characters. Cf. Günter Wagner, Rupert Riedl.
I would say the “structurally how” and “structurally what” is most analogous. The questions we want to answer about minds aren’t like “what is a sufficient set of physical conditions to determine—however opaquely—a mind’s effects”, but rather “what smallish, accessible-ish, designable-ish structures in a mind can [understandably to us, after learning how] determine a mind’s effects, specifically as we think of those effects”. That is more like organology and developmental biology and telic/partial-niche evodevo (<-made up term but hopefully you see what I mean).
https://tsvibt.blogspot.com/2023/04/fundamental-question-what-determines.html
I suppose it depends on what one wants to do with their “understanding” of the system? Here’s one AI safety case I worry about: if we (humans) don’t understand the lower-level ontology that gives rise to the phenomenon that we are more directly interested in (in this case I think thats something like an AI systems behavior/internal “mental” states—your “structurally what”, if I’m understanding correctly, which to be honest I’m not very confident I am), then a sufficiently intelligent AI system that does understand that relationship will be able to exploit the extra degrees of freedom in the lower level ontology to our disadvantage, and we won’t be able to see it coming.
I very much agree that structurally what matters a lot, but that seems like half the battle to me.
But somehow this topic is not afforded much care or interest. Some people will pay lip service to caring, others will deny that mental states exist, but either way the field of alignment doesn’t put much force (money, smart young/new people, social support) toward these questions. This is understandable, as we have much less legible traction on this topic, but that’s… undignified, I guess is the expression.
Even if you do understand the lower level, you couldn’t stop such an adversarial AI from exploiting it, or exploiting something else, and taking control. If you understand the mental states (yeah, the structure), then maybe you can figure out how to make an AI that wants to not do that. In other words, it’s not sufficient, and probably not necessary / not a priority.
This really clicked for me. I don’t blame you for making up the term because, although I can see the theory and examples of papers in that topic, I can’t think of a unifying term that isn’t horrendously broad (e.g. molecular ecology).
Ok. How would this theory look like and how would it cache out into real world consequences ?
This is a derail. I can know that something won’t work without knowing what would work. I don’t claim to know something that would work. If you want my partial thoughts, some of them are here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html
In general, there’s more feedback available at the level of “philosophy of mind” than is appreciated.
I think I am asking a very fair question.
What is the theory of change of your philosophy of mind caching out into something with real-world consequences ?
I.e. a training technique? Design principles? A piece of math ? Etc
All of those, sure? First you understand, then you know what to do. This is a bad way to do peacetime science, but seems more hopeful for
cruel deadline,
requires understanding as-yet-unconceived aspects of Mind.
No, you’re derailing from the topic, which is the fact that the field of alignment keeps failing to even try to avoid / address major partial-consensus defeaters to alignment.
I’m confused why you are so confident in these “defeaters” by which I gather objection/counterarguments to certain lines of attack on the alignment problem.
E.g. I doubt it would be good if the alignment community would outlaw mechinterp/slt/ neuroscience just because of some vague intuition that they don’t operate at the right abstraction.
Certainly, the right level of abstraction is a crucial concern but I dont think progress on this question will be made by blanket dismissals. People in these fields understand very well the problem you are pointing towards. Many people are thinking deeply how to resolve this issue.
More than any one defeater, I’m confident that most people in the alignment field don’t understand the defeaters. Why? I mean, from talking to many of them, and from their choices of research.
I don’t believe you.
This is an insane strawman. Why are you strawmanning what I’m saying?
Progress could only be made by understanding the problems, which can only be done by stating the problems, which you’re calling “blanket dismissals”.
Okay seems like the commentariat agrees I am too combative. I apologize if you feel strawmanned.
Feels like we got a bit stuck. When you say “defeater” what I hear is a very confident blanket dismissal. Maybe that’s not what you have in mind.
Defeater, in my mind, is a failure mode which if you don’t address you will not succeed at aligning sufficiently powerful systems.[1] It does not mean work outside of that focused on them is useless, but at some point you have to deal with the defeaters, and if the vast majority of people working towards alignment don’t get them clearly, and the people who do get them claim we’re nowhere near on track to find a way to beat the defeaters, then that is a scary situation.
This is true even if some of the work being done by people unaware of the defeaters is not useless, e.g. maybe it is successfully averting earlier forms of doom than the ones that require routing around the defeaters.
Not best considered as an argument against specific lines of attack, but as a problem which if unsolved leads inevitably to doom. People with a strong grok of a bunch of these often think that way more timelines are lost to “we didn’t solve these defeaters” than the problems being even plausibly addressed by the class of work being done by most of the field. This does unfortunately make it get used as (and feel like) an argument against those approaches by people who don’t and don’t claim to understand those approaches, but that’s not the generator or important nature of it.
I say “AI x-safety” and “AI x-safety technical research”. I potentially cut the “x-” to just “AI safety” or “AI safety technical research”.
Alternative: “AI x-derisking”
“AI x-safety” seems ok. The “x-” is a bit opaque, and “safety” is vague, but I’ll try this as my default.
(Including “technical” to me would exclude things like public advocacy.)
Yeah, I meant that I use “AI x-safety” to refer to the field overall and “AI x-safety technical research” to specifically refer to technical research in that field (e.g. alignment research).
(Sorry about not making this clear.)
I’ve often preferred a frame of ‘catastrophe avoidance’ over a frame of x-risk. This has a possible downside of people underfeeling the magnitude of risk, but also an upside of IMO feeling way more plausible. I think it’s useful to not need to win specific arguments about extinction, and also to not have some of the existential/extinction conflation happening in ‘x-’.
FWIW this seems overall highly obfuscatory to me. Catastrophic clearly includes things like “A bank loses $500M” and that’s not remotely the same as an existential catastrophe.
It’s much more the same than a lot of prosaic safety though, right?
Let me put it this way: If an AI can’t achieve catastrophe on that order of magnitude, it also probably cannot do something truly existential.
One of the issues this runs into is if a misaligned AI is playing possum, and so doesn’t attempt lesser catastrophes until it can pull off a true takeover. I nonetheless though think this framing points generally at the right type of work (understood that others may disagree of course)
Not confident, but I think that “AIs that cause your civilization problems” and “AIs that overthrow your civilization” may be qualitatively different kinds of AIs. Regardlesss, existential threats are the most important thing here, and we just have a short term (‘x-risk’) that refers to that work.
And anyway I think the ‘catastrophic’ term is already being used to obfuscate, as Anthropic uses it exclusively on their website / in their papers, literally never talking about extinction or disempowerment[1], and we shouldn’t let them get away with that by also adopting their worse terminology.
(And they use the term ‘existential’ 3 times in oblique ways that barely count.)
Yes—the word ‘global’ is a minimum necessary qualification for referring to catastrophes of the type we plausibly care about—and even then, it is not always clear that something like COVID-19 was too small an event to qualify.
How about “AI outcomes”
Insufficiently catchy
perhaps. but my reasoning is something like -
better than “alignment”: what’s being aligned? outcomes should be (citation needed)
better than “ethics”: how does one act ethically? by producing good outcomes (citation needed).
better than “notkilleveryoneism”: I actually would prefer everyone dying now to everyone being tortured for a million years and then dying, for example, and I can come up with many other counterexamples—not dying is not the problem, achieving good things is the problem.
might not work for deontologists. that seems fine to me, I float somewhere between virtue ethics and utilitarianism anyway.
perhaps there are more catchy words that could be used, though. hope to see someone suggest one someday.
[After I wrote down the thing, I became more uncertain about how much weight to give to it. Still, I think it’s a valid consideration to have on your list of considerations.]
“AI alignment”, “AI safety”, “AI (X-)risk”, “AInotkilleveryoneism”, “AI ethics” came to be associated with somewhat specific categories of issues. When somebody says “we should work (or invest more or spend more) on AI {alignment,safety,X-risk,notkilleveryoneism,ethics}”, they communicate that they are concerned about those issues and think that deliberate work on addressing those issues is required or otherwise those issues are probably not going to be addressed (to a sufficient extent, within relevant time, &c.).
“AI outcomes” is even broader/[more inclusive] than any of the above (the only step left to broaden it even further would be perhaps to say “work on AI being good” or, in the other direction, work on “technology/innovation outcomes”) and/but also waters down the issue even more. Now you’re saying “AI is not going to be (sufficiently) good by default (with various AI outcomes people having very different ideas about what makes AI likely not (sufficiently) good by default)”.
It feels like we’re moving in the direction of broadening our scope of consideration to (1) ensure we’re not missing anything, and (2) facilitate coalition building (moral trade?). While this is valid, it risks (1) failing to operate on the/an appropriate level of abstraction, and (2) diluting our stated concerns so much that coalition building becomes too difficult because different people/groups endorsing stated concerns have their own interpretations/beliefs/value systems. (Something something find an optimum (but also be ready and willing to update where you think the optimum lies when situation changes)?)
but how would we do high intensity, highly focused research on something intentionally restructured to be an “AI outcomes” research question? I don’t think this is pointless—agency research might naturally talk about outcomes in a way that is general across a variety of people’s concerns. In particular, ethics and alignment seem like they’re an unnatural split, and outcomes seems like a refactor that could select important problems from both AI autonomy risks and human agency risks. I have more specific threads I could talk about.
Why do you think it’s uninformed? John specifically says that he’s taking “this work is trash” as background and not trying to convince anyone who disagrees. It seems like because he doesn’t try, you assume he doesn’t have an argument?
I kinda think it was necessary. (In that, the thing ~needed to be written and “you should have written this with a lot less antagonism” is not a reasonable ask.)
1) “there are many worlds in which it is too late or fundamentally unable to deliver on its promise while prosaic alignment ideas do. And in worlds in which theory does bear fruit”—Yudkowsky had a post somewhere about you only getting to do one instance of deciding to act as if the world was like X. Otherwise you’re no longer affecting our actual reality. I’m not describing this well at all, but I found the initial point quite persuasive.
2) Highly relevant LW post & concept: The Tale of Alice Almost: Strategies for Dealing With Pretty Good People. People like Yudkowsky and johnswentworth think that vanishingly few people are doing something that’s genuinely helpful for reducing x-risk, and most people are doing things that are useless at best or actively harmful (by increasing capabilities) at worst. So how should they act towards those people? Well, as per the post, that depends on the specific goal:
The mainstream wins the war of ideas by default. So if you think everyone dies if the mainstream wins, then you must argue against the mainstream, right?
I’m curious what you think John’s valid criticisms are. His piece is so hyperbolic that I have to consider all arguments presented there somewhat suspect by default.
Edit: Clearly people disagree with this sentiment. I invite (and will strongly upvote) strong rebuttals.