Rob Bensinger comments on Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Rob Bensinger 31 Aug 2021 1:03 UTC
LW: 14 AF: 6
AF
Nowadays, the vast majority of the field disagree that there’s any hope of formalizing all of philosophy and then just implementing that to get an aligned AGI.
What do you mean by “formalizing all of philosophy”? I don’t see ‘From Philosophy to Math to Engineering’ as arguing that we should turn all of philosophy into math (and I don’t even see the relevance of this to Friendly AI). It’s just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in more formal directions.
I imagine part of Luke’s point in writing the post was to push back against the temptation to see formal and informal approaches as opposed (‘MIRI does informal stuff, so it must not like formalisms’), and to push back against the idea that analytic philosophers ‘own’ whatever topics they happen to have historically discussed.
Conceptual alignment research isn’t just turning philosophy into mathematics. This is a failure mode I warned against recently: what matters is deconfusion, not formalization.
Pearl’s causality (the main example of “turning philosophy into mathematics” Luke uses) was an example of achieving deconfusion about causality, not an example of ‘merely formalizing’ something. I agree that calling this deconfusion is a clearer way of pointing at the thing, though!
- dxu 31 Aug 2021 4:08 UTC
  4 points
  Parent
  Perhaps deconfusion and formalization are not identical, but I’m partial to the notion that if you’ve truly deconfused something (meaning you, personally, are no longer confused about that thing), it should not take much further effort to formalize the thing in question. (Examples of this include TurnTrout’s sequence on power-seeking, John Wentworth’s sequence on abstraction, and Scott Garrabrant’s sequence on Cartesian frames.)
  
  So, although the path to deconfusing some concept may not involve formalizing that concept, being able to formalize the concept is necessary: if you find yourself (for some reason or other) thinking that you’ve deconfused something, but are nonetheless unable to produce a formalization of it, that’s a warning sign that you may not actually be less confused.
  - adamShimi 31 Aug 2021 9:46 UTC
    2 points
    Parent
    Perhaps deconfusion and formalization are not identical, but I’m partial to the notion that if you’ve truly deconfused something (meaning you, personally, are no longer confused about that thing), it should not take much further effort to formalize the thing in question.
    So I have the perspective that deconfusion requires an application. And this application in return constrains what count as a successful deconfusion. There are definitely applications for which successful deconfusion requires a formalization: e.g if you want to implement the concept/use it in a program. But I think it’s important to point out that some applications (maybe most?) don’t require that level of formalization.
    (Examples of this include TurnTrout’s sequence on power-seeking, John Wentworth’s sequence on abstraction, and Scott Garrabrant’s sequence on Cartesian frames.)
    What’s interesting about these examples is that only the abstraction case is considered to have completely deconfused the concept it looked at. I personally think that Alex’s power-seeking formalization is very promising and captures important aspects of the process, but it’s not yet at the point where we can unambiguously apply it to all convergent subgoals for example. Similarly, Cartesian frames sound promising but it’s not clear that they actually completely deconfuse the concept they focus on.
    And that’s exactly what I want to point out: although from the outside these work look more relevant/important/serious because they are formal, the actual value comes from the back-and-forth/grounding in the weird informal intuitions, a fact that I think John, Alex and Scott would agree with.
    So, although the path to deconfusing some concept may not involve formalizing that concept, being able to formalize the concept is necessary: if you find yourself (for some reason or other) thinking that you’ve deconfused something, but are nonetheless unable to produce a formalization of it, that’s a warning sign that you may not actually be less confused.
    But that doesn’t apply to so many of the valuable concepts and ideas we’ve come up with in alignment, like deception, myopia, HCH, universality, orthogonality thesis, convergent subgoals,… Once again the necessary condition isn’t one for value or usefulness, and that seems so overlooked around here. If you can formalize without losing the grounding, by all means do so because that’s a step towards more concrete concepts. Yet not being able to ground everything you think about formally doesn’t mean you’re not making progress in deconfusing them, nor that they can’t be useful at that less than perfectly formal stage.
    - dxu 31 Aug 2021 18:46 UTC
      8 points
      Parent
      I think it’s becoming less clear to me what you mean by deconfusion. In particular, I don’t know what to make of the following claims:
      
      So I have the perspective that deconfusion requires an application. And this application in return constrains what count as a successful deconfusion. [...]
      
      What’s interesting about these examples is that only the abstraction case is considered to have completely deconfused the concept it looked at. [...]
      
      Similarly, Cartesian frames sound promising but it’s not clear that they actually completely deconfuse the concept they focus on. [...]
      
      Yet not being able to ground everything you think about formally doesn’t mean you’re not making progress in deconfusing them, nor that they can’t be useful at that less than perfectly formal stage.
      
      I don’t [presently] think these claims dovetail with my understanding of deconfusion. My [present] understanding of deconfusion is that (loosely speaking) it’s a process for taking ideas from the [fuzzy, intuitive, possibly ill-defined] sub-cluster and moving them to the [concrete, grounded, well-specified] sub-cluster.
      
      I don’t think this process, as I described it, entails having an application in mind. (Perhaps I’m also misunderstanding what you mean by application!) It seems to me that, although many attempts at deconfusion-style alignment research (such as the three examples I gave in my previous comment) might be ultimately said to have been motivated by the “application” of aligning superhuman agents, in practice they happened more because somebody noticed that whenever some word/phrase/cluster-of-related-words-and-phrases came up in conversation, people would talk about them in conflicting ways, use/abuse contradictory intuitions while talking about them, and just in general (to borrow Nate’s words) “continuously accidentally spout nonsense”.
      
      But perhaps from your perspective, that kind of thing also counts as an application, e.g. the application of “making us able to talk about the thing we actually care about”. If so, then:
      
      I agree that it’s possible to make progress towards this goal without performing steps that look like formalization. (I would characterize this as the “philosophy” part of Luke’s post about going from philosophy to math to engineering.)
      
      Conversely, I also agree that it’s possible to perform formalization in a way that doesn’t perfectly capture the essence of “the thing we want to talk about”, or perhaps doesn’t usefully capture it in any sense at all; if one wanted to use unkind words, one could describe the former category as “premature [formalization]”, and the latter category as “unnecessary [formalization]”. (Separately, I also see you as claiming that TurnTrout’s work on power-seeking and Scott’s work on Cartesian frames fall somewhere in the “premature” category, but this may simply be me putting words in your mouth.)
      
      And perhaps your contention is that there’s too much research being done currently that falls under the second bullet point; or, alternatively, that too many people are pursuing research that falls (or may fall) under the second bullet point, in a way that they (counterfactually) wouldn’t if there were less (implicit) prestige attached to formal research.
      
      If this (or something like it) is your claim, then I don’t think I necessarily disagree; in fact, it’s probably fair to say you’re in a better position to judge than I am, being “closer to the ground”. But I also don’t think this precludes my initial position from being valid, where—having laid the groundwork in the previous two bullet points—I can now characterize my initial position as [establishing the existence of] a bullet point number 3:
      
      A successful, complete deconfusion of a concept will, almost by definition, admit to a natural formalization; if one then goes to the further step of producing such a formalism, it will be evident that the essence of the original concept is present in said formalism.
      
      (Or, to borrow Eliezer’s words this time, “Do you understand [the concept/property/attribute] well enough to write a computer program that has it?”)
      
      And yes, in a certain sense perhaps there might be no point to writing a computer program with the [concept/property/attribute] in question, because such a computer program wouldn’t do anything useful. But in another sense, there is a point: the point isn’t to produce a useful computer program, but to check whether your understanding has actually reached the level you think it has. If one further takes the position (as I do) that such checks are useful and necessary, then [replacing “writing a computer program” with “producing a formalism”] I claim that many productive lines of deconfusion research will in fact produce formalisms that look “premature” or even “unnecessary”, as a part of the process of checking the researchers’ understanding.
      
      I think that about sums up [the part of] the disagreement [that I currently know how to verbalize]. I’m curious to see whether you agree this is a valid summary; let me know if (abstractly) you think I’ve been using the term “deconfusion” differently from you, or (concretely) if you disagree with anything I said about “my” version of deconfusion.
      - adamShimi 31 Aug 2021 22:41 UTC
        2 points
        Parent
        I think this is an excellent summary, and I agree with almost all of it. My only claim is that it’s easy to think that deconfusion is only useful when it results in formalization (when it is complete by your sense) but that isn’t actually true, especially at the point where we are in the field. And I’m simply pointing out that for some concrete applications (reasons for which we want to be able to use a concept without spouting nonsense), going to this complete formalization isn’t necessary.
        But yeah, if you have totally deconfused a concept, you should be able to write it down as a mathematical model/program.
- adamShimi 31 Aug 2021 8:06 UTC
  LW: 2 AF: 1
  AF Parent
  Thanks for the comment!
  What do you mean by “formalizing all of philosophy”? I don’t see ‘From Philosophy to Math to Engineering’ as arguing that we should turn all of philosophy into math (and I don’t even see the relevance of this to Friendly AI). It’s just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in a more formal directions.
  I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks that solving the alignment problem requires deconfusing every related philosophical problem in terms of maths, and then implementing that. Such a view doesn’t seem shared by many in the community for a couple of reasons:
  - Some doubt that the level of mathematical formalization required is even possible
  - If timelines are quite short, we probably don’t have the time to do all that.
  - If AGI turns out to be prosaic AGI (which sounds like one of the best bet to make now), then what matters is aligning neural nets, not finding a way of write down a perfectly aligned AGI from scratch (related to the previous point because it seems improbable that the formalization will be finished before neural nets reach AGI, in such a prosaic setting).
  I imagine part of Luke’s point in writing the post was to push back against the temptation to see formal and informal approaches as opposed (‘MIRI does informal stuff, so it must not like formalisms’), and to push back against the idea that analytic philosophers ‘own’ whatever topics they happen to have historically discussed.
  Thanks for that clarification, it makes sense to me. That being said, multiple people (both me a couple of years ago and people I mentor/talk too) seem to have been pushed by MIRI’s work in general to think that they need extremely high-level of maths and formalism to even contribute to alignment, which I disagree with, and apparently Luke and you do too.
  Reading the linked post, what jumps to me is the focus that friendly AI is about turning philosophy into maths, and I think that’s the culprit. That is part of the process, important one and great if we manage it. But expressing and thinking through problems of alignment at a less formal level is still very useful and important; that’s how we have most of the big insights and arguments in the field.
  Pearl’s causality (the main example of “turning philosophy into mathematics” Luke uses) was an example of achieving deconfusion about causality, not an example of ‘merely formalizing’ something. I agree that calling this deconfusion is a clearer way of pointing at the thing, though!
  Funnily, it sounds like MIRI itself (specifically Scott) has call that into doubt with Finite Factored Sets. This work isn’t throwing away all of Pearl’s work, but it argues that some part where missing/some assumptions unwarranted. Even a case of deconfusion as grounded than Pearl’s isn’t necessary the right abstraction/deconfusion.
  The subtlety I’m trying to point out: actually formally deconfusing is really hard, in part because the formalization we come up with seem so much more serious and research-like than the fuzzy intuition underlying it all. And so I found it really useful to always emphasize that what we actually care about is the intuition/weird philosophical thinking, and the mathematical model are just tools to get clearer about the former. Which I expect is obvious for you and Luke, but isn’t for so many others (me from a couple of years ago included).
  - Rob Bensinger 31 Aug 2021 19:19 UTC
    LW: 6 AF: 3
    AF Parent
    Cool, that makes sense!
    I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks that solving the alignment problem requires deconfusing every related philosophical problem in terms of maths, and then implementing that. Such a view doesn’t seem shared by many in the community for a couple of reasons:
    I’m still not totally clear here about which parts were “hyperbole” vs. endorsed. You say that people’s “impression” was that MIRI wanted to deconfuse “every related philosophical problem”, which suggests to me that you think there’s some gap between the impression and reality. But then you say “such a view doesn’t seem shared by many in the community” (as though the “impression” is an actual past-MIRI-view others rejected, rather than a misunderstanding).
    HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or “write down a perfectly aligned AGI from scratch”. The spirit wasn’t ‘we should dutifully work on these problems because they’re Important-sounding and Philosophical’; from my perspective, it was more like ‘we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could go back to sketching how to aim an AGI at intended targets’. As Eliezer put it,
    It was a dumb kind of obstacle to run into—or at least it seemed that way at that time. It seemed like if you could get a textbook from 200 years later, there would be one line of the textbook telling you how to get past that.
    From my perspective, the biggest reason MIRI started diversifying approaches away from our traditional focus was shortening timelines, where we still felt that “conceptual” progress was crucial, and still felt that marginal progress on the Agent Foundations directions would be useful; but we now assigned more probability to ‘there may not be enough time to finish the core AF stuff’, enough to want to put a lot of time into other problems too.
    Actually, I’m not sure how to categorize MIRI’s work using your conceptual vs. applied division. I’d normally assume “conceptual”, because our work is so far away from prosaic alignment; but you also characterize applied alignment research as being about “experimentally testing these ideas [from conceptual alignment]”, which sounds like the 2017-initiated lines of research we described in our 2018 update. If someone is running software experiments to test ideas about “Seeking entirely new low-level foundations for optimization” outside the current ML paradigm, where does that fall?
    If AGI turns out to be prosaic AGI (which sounds like one of the best bet to make now), then what matters is aligning neural nets, not finding a way of write down a perfectly aligned AGI from scratch
    Prosaic AGI alignment and “write down a perfectly aligned AGI from scratch” both seem super doomed to me, compared to approaches that are neither prosaic nor perfectly-neat-and-tidy. Where does research like that fall?
    - adamShimi 31 Aug 2021 22:33 UTC
      LW: 4 AF: 2
      AF Parent
      HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or “write down a perfectly aligned AGI from scratch”. The spirit wasn’t ‘we should dutifully work on these problems because they’re Important-sounding and Philosophical’; from my perspective, it was more like ‘we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could go back to sketching how to aim an AGI at intended targets’.
      I think that the issue is that I have a mental model of this process you describe that summarize it as “you need to solve a lot of philosophical issues for it to work”, and so that’s what I get by default when I query for that agenda. Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that’s inaccurate?
      From my perspective, the biggest reason MIRI started diversifying approaches away from our traditional focus was shortening timelines, where we still felt that “conceptual” progress was crucial, and still felt that marginal progress on the Agent Foundations directions would be useful; but we now assigned more probability to ‘there may not be enough time to finish the core AF stuff’, enough to want to put a lot of time into other problems too.
      Yeah, I think this is a pretty common perspective on that work from outside MIRI. That’s my take (that there isn’t enough time to solve all of the necessary components) and the one I’ve seen people use in discussing MIRI multiple time.
      Actually, I’m not sure how to categorize MIRI’s work using your conceptual vs. applied division. I’d normally assume “conceptual”, because our work is so far away from prosaic alignment; but you also characterize applied alignment research as being about “experimentally testing these ideas [from conceptual alignment]”, which sounds like the 2017-initiated lines of research we described in our 2018 update. If someone is running software experiments to test ideas about “Seeking entirely new low-level foundations for optimization” outside the current ML paradigm, where does that fall?
      A really important point is that the division isn’t meant to split researchers themselves but research. So the experiment part would be applied alignment research and the rest conceptual alignment research. What’s interesting is that this is a good example of applied alignment research that doesn’t have the benefits I mention of more prosaic applied alignment research: being publishable at big ML/AI conferences, being within an accepted paradigm of modern AI...
      Prosaic AGI alignment and “write down a perfectly aligned AGI from scratch” both seem super doomed to me, compared to approaches that are neither prosaic nor perfectly-neat-and-tidy. Where does research like that fall?
      I would say that the non-prosaic approaches require at least some conceptual alignment research (because the research can’t be done fully inside current paradigms of ML and AI), but probably encompass some applied research. Maybe Steve’s work is a good example, with a proposal split of two of his posts in this comment.
      - Rob Bensinger 1 Sep 2021 0:38 UTC
        LW: 6 AF: 3
        AF Parent
        OK, thanks for the clarifications!
        Still, I always had the impression that this line of work focused more on how to build a perfectly rational AGI than on building an aligned one. Can you explain me why that’s inaccurate?
        I don’t know what you mean by “perfectly rational AGI”. (Perfect rationality isn’t achievable, rationality-in-general is convergently instrumental, and rationality is insufficient for getting good outcomes. So why would that be the goal?)
        I think of the basic case for HRAD this way:
        We seem to be pretty confused about a lot of aspects of optimization, reasoning, decision-making, etc. (Embedded Agency is talking about more or less the same set of questions as HRAD, just with subsystem alignment added to the mix.)
        If we were less confused, it might be easier to steer toward approaches to AGI that make it easier to do alignment work like ‘understand what cognitive work the system is doing internally’, ‘ensure that none of the system’s compute is being used to solve problems we don’t understand / didn’t intend’, ‘ensure that the amount of quality-adjusted thinking the system is putting into the task at hand is staying within some bound’, etc.
        
        These approaches won’t look like decision theory, but being confused about basic ground-floor things like decision theory is a sign that you’re likely not in an epistemic position to efficiently find such approaches, much like being confused about how/whether chess is computable is a sign that you’re not in a position to efficiently steer toward good chess AI designs.
        What links here?
        Thoughts on AGI safety from the top by jylin04 (2 Feb 2022 20:06 UTC; 36 points)
        Vladimir_Nesov's comment on Chantiel’s Shortform by Chantiel (1 Sep 2021 10:36 UTC; 2 points)
    - Rob Bensinger 31 Aug 2021 20:22 UTC
      LW: 4 AF: 2
      AF Parent
      Maybe what I want is a two-dimensional “prosaic AI vs. novel AI” and “whiteboards vs. code”. Then I can more clearly say that I’m pretty far toward ‘novel AI’ on one dimension (though not as far as I was in 2015), separate from whether I currently think the bigger bottlenecks (now or in the future) are more whiteboard-ish problems vs. more code-ish problems.
      - adamShimi 31 Aug 2021 22:36 UTC
        LW: 2 AF: 1
        AF Parent
        What you propose seems valuable, although not an alternative to my distinction IMO. This 2-D grid is more about what people consider as the most promising way of getting aligned AGI and how to get there, whereas my distinction focuses on separating two different types of research which have very different methods, epistemic standards and needs in terms of field-building.