johnswentworth comments on On how various plans miss the hard bits of the alignment challenge

johnswentworth 13 Jul 2022 5:10 UTC
LW: 29 AF: 10
10
AF
I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can’t anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise.
I would uncharitably summarize this as “let’s just assume that finding a faithful concrete operationalization of the problem is not itself the hard part”. And then, any time finding a faithful concrete operationalization of the problem is itself the hard part, you basically just automatically fail.
Is that… wrong? Am I missing something here? Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself “the hard part”? (I mean, just intuitively, I’d expect hacking away at the legible parts to induce some progress on the illegible, but it sounds extremely slow, to the point where it would very plausibly just not converge to solving the illegible parts at all.)
If I had to guess at your model here, I’d guess your intuition is something like “well, trying to make progress without concrete operationalizations is just really hard, it’s too easy to become decoupled from mathematical/physical reality”. To which my response would be “just because it’s hard does not mean we can ignore it and still expect to solve the problem, especially in a reasonable timeframe”. Yes, staying grounded is hard when finding faithful concrete operationalizations is itself the hard part of the problem, but we can’t actually avoid that.
It means not becoming too pessimistic about a direction until we see fairly concretely where it’s stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.
This is great early on in the process when we don’t yet know what the hard parts are. But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again. Just ignoring those bottlenecks because we don’t know how to operationalize them yet does not sound like an optimal search-strategy. What we want to do is focus on those intuitions, and figure out generalizable operationalizations of the bottlenecks. That itself is often where the hard work is (especially in alignment). Getting hyper-focused on a single concrete failure mode with a single strategy just results in an operationalization which is too narrow and potentially not relevant to most other strategies; a better approach is to look at intuitively-similar failure modes in a bunch of strategies and try to find an operationalization which unifies them and captures the intuitive pattern.
Similarly, once we have some intuition for where the bottlenecks are, it does seem completely correct to mostly dismiss strategies which are not obviously tackling them in some way, even before the bottlenecks are fully formalized. I mean, maybe spot-check one once in a while, but mostly just ignore such strategies. Otherwise, we just waste a ton of time on strategies which are in fact very likely hopeless.
Uncharitably summarizing again (and hopefully you will correct me if this is inaccurate): it sounds like you want to just not update very much on evidence which we don’t know how to formalize yet. And I’d say this is basically the same mistake as e.g. someone who says we have no idea whether an updated version of a covid vaccine works until there’s been a phase-3 clinical trial with statistically significant result.
It means not becoming too pessimistic about a direction until we see fairly concretely where it’s stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.
Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It’s like the exact opposite of security mindset.
- paulfchristiano 13 Jul 2022 17:39 UTC
  LW: 70 AF: 30
  29
  AF Parent
  I don’t think those are great summaries. I think this is probably some misunderstanding about what ARC is trying to do and about what I mean by “concrete.” In particular, “concrete” doesn’t mean “formalized,” it means more like: you are able to discuss a bunch of concrete examples of the difficulty and why they leads to failure of particular concrete approaches; you are able to point out where the problem will appear in a particular decomposition of the problem, and would revise your picture if that turned out to be wrong; etc.
  You write:
  But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again.
  I don’t yet have this sense about a “sharp left turn” bottleneck.
  I think I would agree with you if we’d looked at a bunch of plausible approaches, and then convinced ourselves that they would fail. And then we tried to introduce the sharp left turn to capture the unifying theme of those failures and to start exploring what’s really going on. At a high level that’s very similar to what ARC is doing day to day, looking at a bunch of approaches to a problem, seeing why they fail, and then trying to understand the nature of the problem so that we can succeed.
  But for the sharp left turn I think we basically don’t have examples. Existing alignment strategies fail in much more basic ways, which I’d call “concrete.” We don’t have examples of strategies that don’t run into concrete difficulties, but they fail for a vague and hard-to-understand reason that we’d summarize as a “sharp left turn.” So I don’t really believe that this difficulty is being abstracted from a pattern of failures.
  There can be other ways to learn about problems, and I didn’t think Nate was even saying that this problem is derived from examples of obstructions to potential alignment approaches. I think Nate’s perspective is that he has some petty good arguments and intuitions about why a sharp left turn will cause novel problems. And so a lot of what I’m saying is that I’m not yet buying it, that I think Nate’s argument has fatal holes in it which are hidden by its vagueness, and that if the arguments are really very important then we should be trying hard to make them more concrete and to address those holes.
  Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself “the hard part”?
  ARC does theoretical work guided by concrete stories about how a proposed AI system could fail; we are focused on the “legible part” insofar as we try to fix failures for which we can tell concrete stories. I’m not quite sure what you mean by “illegible” and so this might just be a miscommunication, but I think this is the relevant sense of “illegible” so I’ll respond briefly to it.
  I think we can tell concrete stories about deceptive alignment; about ontology mismatches making it hard or meaningless to “elicit latent knowledge;” about exploitability of humans making debate impossible; and so on. And I think those stories we can tell seem to do a great job of capturing the reasons why we would expect existing alignment approaches to fail. So if we addressed these concrete stories I would feel like we’ve made real progress. That’s a huge part of my optimism about concrete stories.
  It feels to me like either we are miscommunicating about what ARC is doing, or you are saying that those concrete difficulties aren’t the really important failures. That even if an alignment approach addressed all of them, it still wouldn’t represent meaningful progress because the true risk is the risk that cannot be named.
  One thing you might mean is that “these concrete difficulties are just shadows of a deeper core.” But I think that’s not actually a challenge to ARC’s approach at all, and it’s not that different from my own view. I think that if you have an intuitive sense of a deep problem, then it’s really great to attack specific instantiations of the problem as a way to learn about the deep core. I feel pretty good about this approach, and I think it’s pretty standard in most disciplines that face problems like this (e.g. if you are deeply confused about physics, it’s good to think a lot about the simplest concrete confusing phenomenon and understand it well; if you are confused about how to design algorithms that overcome a conceptual barrier, it’s good to think about the simplest concrete task that requires crossing that barrier; etc.).
  Another thing you might mean is that “these concrete difficulties are distractions from a bigger difficulty that emerges at a later step of the plan.” It’s worth noting that ARC really does try to look at the whole plan and pick the step that is most likely to fail. But I do think it would be a problem for our methodology if there is a good argument about why plans will fail, which won’t let us tell a concrete story about what the failure looks like. My position right now is that I don’t see such an argument; I think we have some vague intuitions, and we have a bunch of examples which do correspond to concrete failure stories. I don’t think there are any examples from which to infer the existence of a difficulty that can’t be captured in concrete stories, and I’m not yet aware of arguments that I find persuasive without any examples. But I’m really quite strongly in the market for such arguments.
  Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It’s like the exact opposite of security mindset.
  Here’s how the situation feels to me. I know this isn’t remotely fair as a summary of your view, it’s just intended to illustrate where ARC is coming from. (It’s also possible this is a research methodology disagreement, in which case I do just disagree strongly.)
  Cryptographer: It seems like our existing proposals for “secure” communication are still vulnerable to man in the middle attacks. Better infrastructure for key distribution is one way to overcome this particular attack, so let’s try to improve that. We can also see how this might fit in with the rest of our security infrastructure to help build to a secure internet, though no doubt the details will change.
  Cryptography skeptic: The real difficulty isn’t man in the middle attacks, it’s that security is really hard. By focusing on concrete stuff like man-in-the-middle you are overlooking the real nature of the problem, focusing on the known risks rather than the unknown unknowns. Someone with a true security mindset wouldn’t be fiddling around the edges like this.
  I’m not saying that infrastructure for key distribution solves security (and indeed we have huge security problems). I’m saying that working on concrete problems is the right way to make progress in situations like this. I don’t think this is in tension with security mindset. In fact I think effective people with security mindset spend most of their time thinking about concrete risks and how to address them.
  It’s great to generalize once you have a bunch of concrete risks and you think there is a deeper underlying pattern. But I think you basically need the examples to learn from, and if there is a real pattern then you should be able to instantiate it in any particular case rather than making reference to the pattern.
  - johnswentworth 14 Jul 2022 0:48 UTC
    LW: 63 AF: 29
    3
    AF Parent
    This was a good reply, I basically buy it. Thanks.
    - Carringtone Kinyanjui 3 Jan 2025 19:41 UTC
      LW: 3 AF: 2
      0
      AF Parent
      I understand the security mindset (from the ordinary paranoia post) as: “What are the unexamined assumptions of your security systems which merely stem from investing or adapting a given model?”. The vulnerability comes from the model. The problem is the “unknowable unknowns”. In addition to the Cryptographer and the Cryptography skeptic, I would add the NSA Quantum computing engineer. Concretisation and operationalisation of these problems may have implicit assumptions that could be system wide catastrophic.
      
      I don’t have clear ways of better articulating this back from analogy to Paul’s concretisations of a proposed AI system. I’m not sure there’s no disanalogy here. However it could be something like “We have this effective model of a proposed AI system. What are useful concretisations in which the AI system would fail?”. The security mindset question would be something like “What representations in the ‘UV-complete’ theory of this AI system would lead to catastrophic failure modes?”
      
      I’m probably missing something here though.
- Richard_Ngo 13 Jul 2022 7:10 UTC
  LW: 33 AF: 16
  6
  AF Parent
  This comment made me notice a kind of duality:
  - Paul wants to focus on finding concrete problems, and claims that Nate/Eliezer aren’t being very concrete with their proposed problems.
  - Nate/Eliezer want to focus on finding concrete solutions, and claim that Paul/other alignment researchers aren’t being very concrete with their proposed solutions.
  It seems like “how well do we understand the problem” is one a crux here. I disagree with John’s comment because it feels like he’s assuming too much about our understanding of the problem. If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn’t exist.
  - paulfchristiano 13 Jul 2022 17:52 UTC
    LW: 34 AF: 16
    5
    AF Parent
    I don’t feel like this is right (though I think this duality feels like a real thing that is important sometimes and is interesting to think about, so appreciated the comment).
    ARC is spending its time right now (i) trying to write down concrete algorithms that solve ELK using heuristic arguments, and then trying to produce concrete examples in which they do the wrong thing, (ii) trying to write down concrete formalizations of heuristic arguments that have the desiderata needed for those algorithms to work, and trying to identify cases in which our algorithms don’t yet meet those desiderata or they may be unachievable. The output is just actual code which is purported to solve major difficulties in alignment.
    And on the flip side, I spend a significant amount of my time looking at the algorithms we are proposing (and the bigger plans into which they would fit if successful) and trying to find the best arguments I can that these plans will fail.
    I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.
    Put differently: I’m not saying that Nate and Eliezer are vague about problems but concrete about solutions, I’m saying they are vague about everything. And I don’t think they are saying that I’m concrete about problems but vague about solutions, they would say that I’m concrete about parts of the solution/problem that don’t matter while systematically pushing all the difficulty into the parts I’m still vague about.
    I do think “how well do we understand the problem” seems like a pretty big crux; that leads Nate and Eliezer to think that I’m avoiding the predictably-important difficulty, and it leads me to think that Nate and Eliezer need to get more concrete in order to have an accurate picture of what’s going on.
    - Richard_Ngo 13 Jul 2022 18:12 UTC
      LW: 4 AF: 3
      0
      AF Parent
      Yeah, my comment was sloppily phrased; I agree with “I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.”
  - johnswentworth 13 Jul 2022 15:34 UTC
    LW: 9 AF: 6
    2
    AF Parent
    If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn’t exist.
    I don’t think that’s how this works? The strategy I’m recommending explicitly contains two parts where we gain evidence about whether a part of the problem actually exists:
    noticing an intuitive pattern in the failure-modes of some strategies
    attempting to formalize (which presumably includes backpropagating our mathematics into our intuitions)
    … so if a part of the problem doesn’t exist, then (a) we probably don’t notice a pattern in the first place, but even if our notoriously unreliable human pattern-matchers over-match, then (b) while we’re attempting to formalize we we have plenty of opportunity to notice that maybe the pattern doesn’t actually exist the way we thought it did.
    It feels like you’re looking for a duality which does not exist. I mean, the duality between “look for concrete solutions” and “look for concrete problems” I buy (and that would indeed cause one side to be over-optimistic and the other over-pessimistic in exactly the pattern we actually see between Paul and Nate/Eliezer). But it feels like you’re also looking for a duality between how-Paul’s-recommended-search-order-just-fails and how-mine-just-fails. And the reason that duality does not exist is because my recommended search order is using strictly more evidence; Paul is basically advocating ignoring a whole class of very useful evidence, and that makes his strategy straightforwardly suboptimal. If we were both picking different points on a pareto frontier, then yeah, there’d be a trade-off. But Paul just isn’t on the pareto frontier.
    - Richard_Ngo 13 Jul 2022 17:38 UTC
      LW: 8 AF: 7
      3
      AF Parent
      I feel confused about the difference between your “attempt to formalize” step and Paul’s “attempt to concretize” step. It feels like you can view either as a step towards the other—if you successfully formalize, then presumably you’ll be able to concretize; but also one valuable step towards formalizing is by finding concrete examples and then generalizing from them. I think everyone agrees that it’d be great to end up with a formalism for the problem, and then disagrees on how much that process should involve “finding concrete examples of the problem”. My own view is that since it’s so incredibly easy for people to get lost in abstractions, people should try to concretize much more when talking about highly abstract domains. (Even when people are confident that they’re not lost in abstractions, like Eliezer and Nate are, that’s still really useful for conveying ideas to other people.)