FYI I think by the time I wrote Optimistic Assumptions, Longterm Planning, and “Cope”, I think I had updated on the things you criticize about it here (but, I had started writing it awhile ago from a different frame and there is something disjointed about it)
But, like, I did mean both halfs of this seriously:
I think you should be scared about this, if you’re the sort of theoretic researcher, who’s trying to cut at the hardest parts of the alignment problem (whose feedback loops are weak or nonexistent)
I think you should be scared about this, if you’re the sort of Prosaic ML researcher who does have a bunch of tempting feedback loops for current generation ML, but a) it’s really not clear whether or how those apply to aligning superintelligent agents, b) many of those feedback loops also basically translate into enhancing AI capabilities and moving us toward a more dangerous world.
...
Re:
For the last few weeks, I’ve been working on trying to find plans for AI safety. They should cover the whole problem, including the major hurdles after intent alignment.
I strongly disagree with this being a good thing to do! We’re not going to have a good, end-to-end plan about how to save the world from AGI.
I think in some sense I agree with you – the actual real plans won’t be end-to-end. And I think I agree with you about some kind of neuroticism that unhelpfully bleeds through a lot of rationalist work. (Maybe in particular: actual real solutions to things tend to be a lot messier than the beautiful code/math/coordination-frameworks an autistic idealist dreams up)
But, there’s still something like “plans are worthless, but planning is essential.” I think you should aim for the standard of “you have a clear story for how your plan fits into something that solves the hard parts of the problem.” (or, we need way more people doing that sort of thing, since most people aren’t really doing it at all)
Some ways that I think about End to End Planning (and, metastrategy more generally)
Because there are multiple failure modes, I treat myself as having multiple constraints I have to satisfy:
My plans should backchain from solving the key problems I think we ultimately need to solve
My plans should forward chain through tractable near goals with at least okay-ish feedback loops. (If the okay-ish feedback loops don’t exist yet, try to be inventing them. Although don’t follow that off a cliff either – I was intigued by Wentworth’s recent note that overly focusing on feedback loops led him to predictably waste some time)
Ship something to external people, fairly regularly
Be Wholesome (that is to say, when I look at the whole of what I’m doing it feels healthy, not like I’ve accidentally min-maxed my way into some brittle extreme corner of optimization space)
And for end to end planning, have a plan for...
up through the end of my current OODA loop
(maybe up through a second OODA loop if I have a strong guess for how the first OODA loop goes)
as concrete a plan as I can, assuming no major updates from the first OODA loop, up through the end of the agenda.
as concrete a visualization of the followup steps after my plan ends, for how it goes on to positively impact the world.
End to End plans don’t mean you don’t need to find better feedbackloops or pivot. You should plan that into the plan (And also expect to be surprised about it anyway). But, I think if you don’t concretely visualize how it fits together you’re like to go down some predictably wasteful paths.
Thanks for the context! I didn’t follow this discourse very closely, but I think your “optimistic assumptions” post wasn’t the main offender—it’s reasonable to say that “it’s suspicious when people are bad at backchaining but think they’re good at backchaining or their job depends on backchaining more than they are able to”. I seem to remember reading some responses/ related posts that I had more issues with, where the takeaway was explicitly that “alignment researchers should try harder at backchaining and one-shotting baba-is-you-like problems because that’s the most important thing”, instead of the more obvious but less rationalism-vibed takeaway of “you must (if at all possible) avoid situations where you have to one-shot complicated games”.
I think if I’m reading you correctly, we’re largely in agreement. All plan-making and game-playing depends on some amount of backchaining/ one-shot prediction. And there is a part of doing science that looks a bit like this. But there are ways of getting around having to brute-force this by noticing regularities and developing intuitions, taking “explore” directions in explore-exploit tradeoffs, etc. -- this is sort of the whole point of RL, for example.
I also very much like the points you made about plans. I’d love to understand more about your OODA loop points, but I haven’t yet been able to find a good “layperson” operationalization of OODA that’s not competence porn (in general, I find “sequential problem-solving” stuff coming from pilot training useful as inspiration, but not directly applicable because the context is so different—and I’d love a good reference here that addresses this carefully).
A vaguely related picture I had in my mind when thinking about the Baba is you discourse (and writing this shortform) comes from being a competitive chess player in middle school. Namely, in middle school competitions and in friendly training games in chess club, people make a big deal out of the “touch move” rule: that you’re not allowed to play around with pieces when planning and you need to form a plan entirely in your head. But then when you see a friendly game between two high-level chess players, they will constantly play around with each other’s pieces to show each other positions several moves into the game that would result from various choices. To someone on a high level (higher than I ever got to), there is very little difference between playing out a game on the board and playing it out in your head, but it’s helpful to move pieces around to communicate your ideas to your partner. I think that (even with a scratchpad), there’s a component of this here: there is a kind of qualitative difference between “learning to track hypothetical positions well” / “learning results” / “being good at memorization and flashcards” vs. having better intuitions and ideas. A lot of learning a field / being a novice in anything consists of being good at the former. But I think “science” as it were progresses by people getting good at the latter. Here I actually don’t think that the “do better vibe” corresponds to not being good at generating new ideas: rather, I think that rationalists (correctly) cultivate a novice mentality, where they constantly learn new skills and approach new areas, where the “train an area of your brain to track sequential behaviors well” (analogous to “mentally chain several moves forward in Baba is you”) is the core skill. And then when rationalists do develop this area and start running “have and test your ideas and intuitions in this environment” loops, these are harder to communicate/ analyze, and so their importance sort of falls off in the discourse (while on an individual level people are often quite good at these—in fact, the very skill of “communicating well about sequential thinking” is something that many rationalists have developed deep competence in I think).
FYI I think by the time I wrote Optimistic Assumptions, Longterm Planning, and “Cope”, I think I had updated on the things you criticize about it here (but, I had started writing it awhile ago from a different frame and there is something disjointed about it)
But, like, I did mean both halfs of this seriously:
...
Re:
I think in some sense I agree with you – the actual real plans won’t be end-to-end. And I think I agree with you about some kind of neuroticism that unhelpfully bleeds through a lot of rationalist work. (Maybe in particular: actual real solutions to things tend to be a lot messier than the beautiful code/math/coordination-frameworks an autistic idealist dreams up)
But, there’s still something like “plans are worthless, but planning is essential.” I think you should aim for the standard of “you have a clear story for how your plan fits into something that solves the hard parts of the problem.” (or, we need way more people doing that sort of thing, since most people aren’t really doing it at all)
Some ways that I think about End to End Planning (and, metastrategy more generally)
Because there are multiple failure modes, I treat myself as having multiple constraints I have to satisfy:
My plans should backchain from solving the key problems I think we ultimately need to solve
My plans should forward chain through tractable near goals with at least okay-ish feedback loops. (If the okay-ish feedback loops don’t exist yet, try to be inventing them. Although don’t follow that off a cliff either – I was intigued by Wentworth’s recent note that overly focusing on feedback loops led him to predictably waste some time)
Ship something to external people, fairly regularly
Be Wholesome (that is to say, when I look at the whole of what I’m doing it feels healthy, not like I’ve accidentally min-maxed my way into some brittle extreme corner of optimization space)
And for end to end planning, have a plan for...
up through the end of my current OODA loop
(maybe up through a second OODA loop if I have a strong guess for how the first OODA loop goes)
as concrete a plan as I can, assuming no major updates from the first OODA loop, up through the end of the agenda.
as concrete a visualization of the followup steps after my plan ends, for how it goes on to positively impact the world.
End to End plans don’t mean you don’t need to find better feedbackloops or pivot. You should plan that into the plan (And also expect to be surprised about it anyway). But, I think if you don’t concretely visualize how it fits together you’re like to go down some predictably wasteful paths.
Thanks for the context! I didn’t follow this discourse very closely, but I think your “optimistic assumptions” post wasn’t the main offender—it’s reasonable to say that “it’s suspicious when people are bad at backchaining but think they’re good at backchaining or their job depends on backchaining more than they are able to”. I seem to remember reading some responses/ related posts that I had more issues with, where the takeaway was explicitly that “alignment researchers should try harder at backchaining and one-shotting baba-is-you-like problems because that’s the most important thing”, instead of the more obvious but less rationalism-vibed takeaway of “you must (if at all possible) avoid situations where you have to one-shot complicated games”.
I think if I’m reading you correctly, we’re largely in agreement. All plan-making and game-playing depends on some amount of backchaining/ one-shot prediction. And there is a part of doing science that looks a bit like this. But there are ways of getting around having to brute-force this by noticing regularities and developing intuitions, taking “explore” directions in explore-exploit tradeoffs, etc. -- this is sort of the whole point of RL, for example.
I also very much like the points you made about plans. I’d love to understand more about your OODA loop points, but I haven’t yet been able to find a good “layperson” operationalization of OODA that’s not competence porn (in general, I find “sequential problem-solving” stuff coming from pilot training useful as inspiration, but not directly applicable because the context is so different—and I’d love a good reference here that addresses this carefully).
A vaguely related picture I had in my mind when thinking about the Baba is you discourse (and writing this shortform) comes from being a competitive chess player in middle school. Namely, in middle school competitions and in friendly training games in chess club, people make a big deal out of the “touch move” rule: that you’re not allowed to play around with pieces when planning and you need to form a plan entirely in your head. But then when you see a friendly game between two high-level chess players, they will constantly play around with each other’s pieces to show each other positions several moves into the game that would result from various choices. To someone on a high level (higher than I ever got to), there is very little difference between playing out a game on the board and playing it out in your head, but it’s helpful to move pieces around to communicate your ideas to your partner. I think that (even with a scratchpad), there’s a component of this here: there is a kind of qualitative difference between “learning to track hypothetical positions well” / “learning results” / “being good at memorization and flashcards” vs. having better intuitions and ideas. A lot of learning a field / being a novice in anything consists of being good at the former. But I think “science” as it were progresses by people getting good at the latter. Here I actually don’t think that the “do better vibe” corresponds to not being good at generating new ideas: rather, I think that rationalists (correctly) cultivate a novice mentality, where they constantly learn new skills and approach new areas, where the “train an area of your brain to track sequential behaviors well” (analogous to “mentally chain several moves forward in Baba is you”) is the core skill. And then when rationalists do develop this area and start running “have and test your ideas and intuitions in this environment” loops, these are harder to communicate/ analyze, and so their importance sort of falls off in the discourse (while on an individual level people are often quite good at these—in fact, the very skill of “communicating well about sequential thinking” is something that many rationalists have developed deep competence in I think).