Alignment is not all you need. But that doesn’t mean you don’t need alignment.
One of the fairytales I remember reading from my childhood is the “Three sillies”. The story is about a farmer encountering three episodes of human silliness, but it’s set in one more frame story of silliness: his wife is despondent because there is an axe hanging in their cottage, and she thinks that if they have a son, he will walk underneath the axe and it will fall on his head.
The frame story was much more memorable to me than any of the “body” stories, and I randomly remember this story much more often than any other fairytale I read at the age I read fairytales. I think the reason for this is that the “hanging axe” worry is a vibe very familiar from my family and friend circle, and more generally a particular kind of intellectual neuroticism that I encounter all the time, that is terrified of incomplete control or understanding.
I really like the rationalist/EA ecosphere because of its emphasis on the solvability of problems like this: noticing situations where you can just approach the problem, taking down the axe. However, a baseline of intellectual neuroticism persists (after all you wouldn’t expect otherwise from a group of people who pull smoke alarms on pandemics and existential threats that others don’t notice). Sometimes it’s harmless or even beneficial. But a kind of neuroticism in the community that bothers me, and seems counterproductive, is a certain “do it perfectly or you’re screwed” perfectionism that pervades a lot of discussions. (This is also familiar to me from my time as a mathematician: I’ve had discussions with very intelligent and pragmatic friends who rejected even the most basic experimentally confirmed facts of physics because “they aren’t rigorously proven”.)
A particular train of discussion that annoyed me in this vein was the series of responses to Raemon’s “preplanning and baba is you” post. The initial post I think makes a nice point—it suggests as an experiment trying to solve levels of a move-based logic game by pre-planning every step in advance, and points out that this is hard. Various people tried this experiment and found that it’s hard. This was presented as an issue in solving alignment, in worlds where “we get one shot”. But what annoyed me was the takeaway.
I think a lot of the great things about the intellectual vibe in the (extended) LW and EA communities is that “you have more ways to solve problems than you think”. However, there is a particular kind of virtue-signally class of problems where trying to find shortcuts or alternatives is frowned upon and the only accepted form of approach is “trying harder” (another generalized intellectual current in the LW-osphere that I strongly dislike).
Back to the “Baba is you” experiment. The best takeaway, I think, is that we should avoid situations where we need to solve complex problems in one shot, and we should work towards making sure this situation doesn’t exist (and we should just give up on trying to make progress in worlds where we get absolutely no new insights before the do-or-die step of making AGI). Doing so, at least without superhuman assistance, is basically impossible. Attempts at this tend to be not only silly but counterproductive: the “graveyard” of failed idealistic movements are chock-full of wannabe Hari Seldons who believe that they have found the “perfect solution”, and are willing to sacrifice everything to realize their grand vision.
This doesn’t mean we need to give up, or only work on unambitious, practical applications. But it does mean that we have to admit that things can be useful to work on in expectation before we have a “complete story for how they save the world”.
Note that what is being advocated here is not an “anything goes” mentality. I certainly think that AI safety research can be too abstract, too removed from any realistic application in any world. But there is a large spectrum of possibilities between “fully plan how you will solve a complex logic game before trying anything” and “make random jerky moves because they ‘feel right’”.
I’m writing this in response to Adam Jones’ article on AI safety content.. I like a lot of the suggestions. But I think the section on alignment plans suffers from the “axe” fallacy that I claim is somewhat endemic here. Here’s the relevant quote:
For the last few weeks, I’ve been working on trying to find plans for AI safety. They should cover the whole problem, including the major hurdles after intent alignment. Unfortunately, this has not gone well—my rough conclusion is that there aren’t any very clear and well publicised plans (or even very plausible stories) for making this go well. (More context on some of this work can be found in BlueDot Impact’s AI safety strategist job posting).
(emphasis mine).
I strongly disagree with this being a good thing to do!
We’re not going to have a good, end-to-end plan about how to save the world from AGI. Even now, with ever more impressive and scary AIs becoming a comonplace, we have very little idea about what AGI will look like, what kinds of misalignment it will have, where the hard bits of checking it for intent and value alignment will be. Trying to make extensive end-to-end plans can be useful, but can also lead to a strong streetlight effect: we’ll be overcommitting to current understanding, current frames of thought (in an alignment community that is growing and integrating new ideas with an exponential rate that can be factored in months, not years).
Don’t get me wrong. I think it’s valuable to try to plan things where our current understanding is likely to at least partially persist: how AI will interface with government, general questions of scaling and rough models of future development. But we should also understand that our map has lots of blanks, especially when we get down to thinking about what we will understand in the future. What kinds of worrying behaviors will turn out to be relevant and which ones will be silly in retrospect? What kinds of guarantees and theoretical foundations will our understanding of AI encompass? We really don’t know, and trying to chart a course through only the parts of the map that are currently filled out is an extremely limited way of looking at things.
So instead of trying to solve the alignment problem end to end what I think we should be doing is:
getting a variety of good, rough frames on how the future of AI might go
thinking about how these will integrate with human systems like government, industry, etc.
understanding more things, to build better models in the future.
I think the last point is crucial, and should be what modern alignment and interpretability is focused on. We really do understand a lot more about AI than we did a few years ago (I’m planning a post on this). And we’ll understand more still. But we don’t know what this understanding will be. We don’t know how it will integrate with existing and emergent actors and incentives. So instead of trying to one-shot the game and write an ab initio plan for how work on quantifying creativity in generative vision models will lead to the world being saved, I think there is a lot of room to just do good research. Fill in the blank patches on that map before routing a definitive course on it. Sure, maybe don’t waste time on the patches in the far corners which are too abstract or speculative or involve too much backchaining. But also don’t try to predict all the axes that will be on the wall in the future before looking more carefully at a specific, potentially interesting, axe.
FYI I think by the time I wrote Optimistic Assumptions, Longterm Planning, and “Cope”, I think I had updated on the things you criticize about it here (but, I had started writing it awhile ago from a different frame and there is something disjointed about it)
But, like, I did mean both halfs of this seriously:
I think you should be scared about this, if you’re the sort of theoretic researcher, who’s trying to cut at the hardest parts of the alignment problem (whose feedback loops are weak or nonexistent)
I think you should be scared about this, if you’re the sort of Prosaic ML researcher who does have a bunch of tempting feedback loops for current generation ML, but a) it’s really not clear whether or how those apply to aligning superintelligent agents, b) many of those feedback loops also basically translate into enhancing AI capabilities and moving us toward a more dangerous world.
...
Re:
For the last few weeks, I’ve been working on trying to find plans for AI safety. They should cover the whole problem, including the major hurdles after intent alignment.
I strongly disagree with this being a good thing to do! We’re not going to have a good, end-to-end plan about how to save the world from AGI.
I think in some sense I agree with you – the actual real plans won’t be end-to-end. And I think I agree with you about some kind of neuroticism that unhelpfully bleeds through a lot of rationalist work. (Maybe in particular: actual real solutions to things tend to be a lot messier than the beautiful code/math/coordination-frameworks an autistic idealist dreams up)
But, there’s still something like “plans are worthless, but planning is essential.” I think you should aim for the standard of “you have a clear story for how your plan fits into something that solves the hard parts of the problem.” (or, we need way more people doing that sort of thing, since most people aren’t really doing it at all)
Some ways that I think about End to End Planning (and, metastrategy more generally)
Because there are multiple failure modes, I treat myself as having multiple constraints I have to satisfy:
My plans should backchain from solving the key problems I think we ultimately need to solve
My plans should forward chain through tractable near goals with at least okay-ish feedback loops. (If the okay-ish feedback loops don’t exist yet, try to be inventing them. Although don’t follow that off a cliff either – I was intigued by Wentworth’s recent note that overly focusing on feedback loops led him to predictably waste some time)
Ship something to external people, fairly regularly
Be Wholesome (that is to say, when I look at the whole of what I’m doing it feels healthy, not like I’ve accidentally min-maxed my way into some brittle extreme corner of optimization space)
And for end to end planning, have a plan for...
up through the end of my current OODA loop
(maybe up through a second OODA loop if I have a strong guess for how the first OODA loop goes)
as concrete a plan as I can, assuming no major updates from the first OODA loop, up through the end of the agenda.
as concrete a visualization of the followup steps after my plan ends, for how it goes on to positively impact the world.
End to End plans don’t mean you don’t need to find better feedbackloops or pivot. You should plan that into the plan (And also expect to be surprised about it anyway). But, I think if you don’t concretely visualize how it fits together you’re like to go down some predictably wasteful paths.
Thanks for the context! I didn’t follow this discourse very closely, but I think your “optimistic assumptions” post wasn’t the main offender—it’s reasonable to say that “it’s suspicious when people are bad at backchaining but think they’re good at backchaining or their job depends on backchaining more than they are able to”. I seem to remember reading some responses/ related posts that I had more issues with, where the takeaway was explicitly that “alignment researchers should try harder at backchaining and one-shotting baba-is-you-like problems because that’s the most important thing”, instead of the more obvious but less rationalism-vibed takeaway of “you must (if at all possible) avoid situations where you have to one-shot complicated games”.
I think if I’m reading you correctly, we’re largely in agreement. All plan-making and game-playing depends on some amount of backchaining/ one-shot prediction. And there is a part of doing science that looks a bit like this. But there are ways of getting around having to brute-force this by noticing regularities and developing intuitions, taking “explore” directions in explore-exploit tradeoffs, etc. -- this is sort of the whole point of RL, for example.
I also very much like the points you made about plans. I’d love to understand more about your OODA loop points, but I haven’t yet been able to find a good “layperson” operationalization of OODA that’s not competence porn (in general, I find “sequential problem-solving” stuff coming from pilot training useful as inspiration, but not directly applicable because the context is so different—and I’d love a good reference here that addresses this carefully).
A vaguely related picture I had in my mind when thinking about the Baba is you discourse (and writing this shortform) comes from being a competitive chess player in middle school. Namely, in middle school competitions and in friendly training games in chess club, people make a big deal out of the “touch move” rule: that you’re not allowed to play around with pieces when planning and you need to form a plan entirely in your head. But then when you see a friendly game between two high-level chess players, they will constantly play around with each other’s pieces to show each other positions several moves into the game that would result from various choices. To someone on a high level (higher than I ever got to), there is very little difference between playing out a game on the board and playing it out in your head, but it’s helpful to move pieces around to communicate your ideas to your partner. I think that (even with a scratchpad), there’s a component of this here: there is a kind of qualitative difference between “learning to track hypothetical positions well” / “learning results” / “being good at memorization and flashcards” vs. having better intuitions and ideas. A lot of learning a field / being a novice in anything consists of being good at the former. But I think “science” as it were progresses by people getting good at the latter. Here I actually don’t think that the “do better vibe” corresponds to not being good at generating new ideas: rather, I think that rationalists (correctly) cultivate a novice mentality, where they constantly learn new skills and approach new areas, where the “train an area of your brain to track sequential behaviors well” (analogous to “mentally chain several moves forward in Baba is you”) is the core skill. And then when rationalists do develop this area and start running “have and test your ideas and intuitions in this environment” loops, these are harder to communicate/ analyze, and so their importance sort of falls off in the discourse (while on an individual level people are often quite good at these—in fact, the very skill of “communicating well about sequential thinking” is something that many rationalists have developed deep competence in I think).
I think the reason for this is that the “hanging axe” worry is a vibe very familiar from my family and friend circle, and more generally a particular kind of intellectual neuroticism that I encounter all the time, that is terrified of incomplete control or understanding.
Something like this is a big reason why I’m not a fan of MIRI, because I think this sort of neuroticism is at somewhat encouraged by that group.
Also, remember that the current LW community is selected for scrupulosity and neuroticism, which IMO is not that good for solving a lot of problems:
Alignment is not all you need. But that doesn’t mean you don’t need alignment.
One of the fairytales I remember reading from my childhood is the “Three sillies”. The story is about a farmer encountering three episodes of human silliness, but it’s set in one more frame story of silliness: his wife is despondent because there is an axe hanging in their cottage, and she thinks that if they have a son, he will walk underneath the axe and it will fall on his head.
The frame story was much more memorable to me than any of the “body” stories, and I randomly remember this story much more often than any other fairytale I read at the age I read fairytales. I think the reason for this is that the “hanging axe” worry is a vibe very familiar from my family and friend circle, and more generally a particular kind of intellectual neuroticism that I encounter all the time, that is terrified of incomplete control or understanding.
I really like the rationalist/EA ecosphere because of its emphasis on the solvability of problems like this: noticing situations where you can just approach the problem, taking down the axe. However, a baseline of intellectual neuroticism persists (after all you wouldn’t expect otherwise from a group of people who pull smoke alarms on pandemics and existential threats that others don’t notice). Sometimes it’s harmless or even beneficial. But a kind of neuroticism in the community that bothers me, and seems counterproductive, is a certain “do it perfectly or you’re screwed” perfectionism that pervades a lot of discussions. (This is also familiar to me from my time as a mathematician: I’ve had discussions with very intelligent and pragmatic friends who rejected even the most basic experimentally confirmed facts of physics because “they aren’t rigorously proven”.)
A particular train of discussion that annoyed me in this vein was the series of responses to Raemon’s “preplanning and baba is you” post. The initial post I think makes a nice point—it suggests as an experiment trying to solve levels of a move-based logic game by pre-planning every step in advance, and points out that this is hard. Various people tried this experiment and found that it’s hard. This was presented as an issue in solving alignment, in worlds where “we get one shot”. But what annoyed me was the takeaway.
I think a lot of the great things about the intellectual vibe in the (extended) LW and EA communities is that “you have more ways to solve problems than you think”. However, there is a particular kind of virtue-signally class of problems where trying to find shortcuts or alternatives is frowned upon and the only accepted form of approach is “trying harder” (another generalized intellectual current in the LW-osphere that I strongly dislike).
Back to the “Baba is you” experiment. The best takeaway, I think, is that we should avoid situations where we need to solve complex problems in one shot, and we should work towards making sure this situation doesn’t exist (and we should just give up on trying to make progress in worlds where we get absolutely no new insights before the do-or-die step of making AGI). Doing so, at least without superhuman assistance, is basically impossible. Attempts at this tend to be not only silly but counterproductive: the “graveyard” of failed idealistic movements are chock-full of wannabe Hari Seldons who believe that they have found the “perfect solution”, and are willing to sacrifice everything to realize their grand vision.
This doesn’t mean we need to give up, or only work on unambitious, practical applications. But it does mean that we have to admit that things can be useful to work on in expectation before we have a “complete story for how they save the world”.
Note that what is being advocated here is not an “anything goes” mentality. I certainly think that AI safety research can be too abstract, too removed from any realistic application in any world. But there is a large spectrum of possibilities between “fully plan how you will solve a complex logic game before trying anything” and “make random jerky moves because they ‘feel right’”.
I’m writing this in response to Adam Jones’ article on AI safety content.. I like a lot of the suggestions. But I think the section on alignment plans suffers from the “axe” fallacy that I claim is somewhat endemic here. Here’s the relevant quote:
I strongly disagree with this being a good thing to do!
We’re not going to have a good, end-to-end plan about how to save the world from AGI. Even now, with ever more impressive and scary AIs becoming a comonplace, we have very little idea about what AGI will look like, what kinds of misalignment it will have, where the hard bits of checking it for intent and value alignment will be. Trying to make extensive end-to-end plans can be useful, but can also lead to a strong streetlight effect: we’ll be overcommitting to current understanding, current frames of thought (in an alignment community that is growing and integrating new ideas with an exponential rate that can be factored in months, not years).
Don’t get me wrong. I think it’s valuable to try to plan things where our current understanding is likely to at least partially persist: how AI will interface with government, general questions of scaling and rough models of future development. But we should also understand that our map has lots of blanks, especially when we get down to thinking about what we will understand in the future. What kinds of worrying behaviors will turn out to be relevant and which ones will be silly in retrospect? What kinds of guarantees and theoretical foundations will our understanding of AI encompass? We really don’t know, and trying to chart a course through only the parts of the map that are currently filled out is an extremely limited way of looking at things.
So instead of trying to solve the alignment problem end to end what I think we should be doing is:
getting a variety of good, rough frames on how the future of AI might go
thinking about how these will integrate with human systems like government, industry, etc.
understanding more things, to build better models in the future.
I think the last point is crucial, and should be what modern alignment and interpretability is focused on. We really do understand a lot more about AI than we did a few years ago (I’m planning a post on this). And we’ll understand more still. But we don’t know what this understanding will be. We don’t know how it will integrate with existing and emergent actors and incentives. So instead of trying to one-shot the game and write an ab initio plan for how work on quantifying creativity in generative vision models will lead to the world being saved, I think there is a lot of room to just do good research. Fill in the blank patches on that map before routing a definitive course on it. Sure, maybe don’t waste time on the patches in the far corners which are too abstract or speculative or involve too much backchaining. But also don’t try to predict all the axes that will be on the wall in the future before looking more carefully at a specific, potentially interesting, axe.
FYI I think by the time I wrote Optimistic Assumptions, Longterm Planning, and “Cope”, I think I had updated on the things you criticize about it here (but, I had started writing it awhile ago from a different frame and there is something disjointed about it)
But, like, I did mean both halfs of this seriously:
...
Re:
I think in some sense I agree with you – the actual real plans won’t be end-to-end. And I think I agree with you about some kind of neuroticism that unhelpfully bleeds through a lot of rationalist work. (Maybe in particular: actual real solutions to things tend to be a lot messier than the beautiful code/math/coordination-frameworks an autistic idealist dreams up)
But, there’s still something like “plans are worthless, but planning is essential.” I think you should aim for the standard of “you have a clear story for how your plan fits into something that solves the hard parts of the problem.” (or, we need way more people doing that sort of thing, since most people aren’t really doing it at all)
Some ways that I think about End to End Planning (and, metastrategy more generally)
Because there are multiple failure modes, I treat myself as having multiple constraints I have to satisfy:
My plans should backchain from solving the key problems I think we ultimately need to solve
My plans should forward chain through tractable near goals with at least okay-ish feedback loops. (If the okay-ish feedback loops don’t exist yet, try to be inventing them. Although don’t follow that off a cliff either – I was intigued by Wentworth’s recent note that overly focusing on feedback loops led him to predictably waste some time)
Ship something to external people, fairly regularly
Be Wholesome (that is to say, when I look at the whole of what I’m doing it feels healthy, not like I’ve accidentally min-maxed my way into some brittle extreme corner of optimization space)
And for end to end planning, have a plan for...
up through the end of my current OODA loop
(maybe up through a second OODA loop if I have a strong guess for how the first OODA loop goes)
as concrete a plan as I can, assuming no major updates from the first OODA loop, up through the end of the agenda.
as concrete a visualization of the followup steps after my plan ends, for how it goes on to positively impact the world.
End to End plans don’t mean you don’t need to find better feedbackloops or pivot. You should plan that into the plan (And also expect to be surprised about it anyway). But, I think if you don’t concretely visualize how it fits together you’re like to go down some predictably wasteful paths.
Thanks for the context! I didn’t follow this discourse very closely, but I think your “optimistic assumptions” post wasn’t the main offender—it’s reasonable to say that “it’s suspicious when people are bad at backchaining but think they’re good at backchaining or their job depends on backchaining more than they are able to”. I seem to remember reading some responses/ related posts that I had more issues with, where the takeaway was explicitly that “alignment researchers should try harder at backchaining and one-shotting baba-is-you-like problems because that’s the most important thing”, instead of the more obvious but less rationalism-vibed takeaway of “you must (if at all possible) avoid situations where you have to one-shot complicated games”.
I think if I’m reading you correctly, we’re largely in agreement. All plan-making and game-playing depends on some amount of backchaining/ one-shot prediction. And there is a part of doing science that looks a bit like this. But there are ways of getting around having to brute-force this by noticing regularities and developing intuitions, taking “explore” directions in explore-exploit tradeoffs, etc. -- this is sort of the whole point of RL, for example.
I also very much like the points you made about plans. I’d love to understand more about your OODA loop points, but I haven’t yet been able to find a good “layperson” operationalization of OODA that’s not competence porn (in general, I find “sequential problem-solving” stuff coming from pilot training useful as inspiration, but not directly applicable because the context is so different—and I’d love a good reference here that addresses this carefully).
A vaguely related picture I had in my mind when thinking about the Baba is you discourse (and writing this shortform) comes from being a competitive chess player in middle school. Namely, in middle school competitions and in friendly training games in chess club, people make a big deal out of the “touch move” rule: that you’re not allowed to play around with pieces when planning and you need to form a plan entirely in your head. But then when you see a friendly game between two high-level chess players, they will constantly play around with each other’s pieces to show each other positions several moves into the game that would result from various choices. To someone on a high level (higher than I ever got to), there is very little difference between playing out a game on the board and playing it out in your head, but it’s helpful to move pieces around to communicate your ideas to your partner. I think that (even with a scratchpad), there’s a component of this here: there is a kind of qualitative difference between “learning to track hypothetical positions well” / “learning results” / “being good at memorization and flashcards” vs. having better intuitions and ideas. A lot of learning a field / being a novice in anything consists of being good at the former. But I think “science” as it were progresses by people getting good at the latter. Here I actually don’t think that the “do better vibe” corresponds to not being good at generating new ideas: rather, I think that rationalists (correctly) cultivate a novice mentality, where they constantly learn new skills and approach new areas, where the “train an area of your brain to track sequential behaviors well” (analogous to “mentally chain several moves forward in Baba is you”) is the core skill. And then when rationalists do develop this area and start running “have and test your ideas and intuitions in this environment” loops, these are harder to communicate/ analyze, and so their importance sort of falls off in the discourse (while on an individual level people are often quite good at these—in fact, the very skill of “communicating well about sequential thinking” is something that many rationalists have developed deep competence in I think).
Something like this is a big reason why I’m not a fan of MIRI, because I think this sort of neuroticism is at somewhat encouraged by that group.
Also, remember that the current LW community is selected for scrupulosity and neuroticism, which IMO is not that good for solving a lot of problems:
Richard Ngo and John Maxwell illustrate it here:
https://www.lesswrong.com/posts/uM6mENiJi2pNPpdnC/takeaways-from-one-year-of-lockdown#g4BJEqLdvzgsjngX2
https://www.lesswrong.com/posts/uM6mENiJi2pNPpdnC/takeaways-from-one-year-of-lockdown#3GqvMFTNdCqgfRNKZ
Cool! I haven’t seen these, good to have these to point to (and I’m glad that Richard Ngo has thought about this)