paulfchristiano comments on On how various plans miss the hard bits of the alignment challenge

paulfchristiano 13 Jul 2022 17:39 UTC
LW: 70 AF: 30
29
AF
I don’t think those are great summaries. I think this is probably some misunderstanding about what ARC is trying to do and about what I mean by “concrete.” In particular, “concrete” doesn’t mean “formalized,” it means more like: you are able to discuss a bunch of concrete examples of the difficulty and why they leads to failure of particular concrete approaches; you are able to point out where the problem will appear in a particular decomposition of the problem, and would revise your picture if that turned out to be wrong; etc.
You write:
But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again.
I don’t yet have this sense about a “sharp left turn” bottleneck.
I think I would agree with you if we’d looked at a bunch of plausible approaches, and then convinced ourselves that they would fail. And then we tried to introduce the sharp left turn to capture the unifying theme of those failures and to start exploring what’s really going on. At a high level that’s very similar to what ARC is doing day to day, looking at a bunch of approaches to a problem, seeing why they fail, and then trying to understand the nature of the problem so that we can succeed.
But for the sharp left turn I think we basically don’t have examples. Existing alignment strategies fail in much more basic ways, which I’d call “concrete.” We don’t have examples of strategies that don’t run into concrete difficulties, but they fail for a vague and hard-to-understand reason that we’d summarize as a “sharp left turn.” So I don’t really believe that this difficulty is being abstracted from a pattern of failures.
There can be other ways to learn about problems, and I didn’t think Nate was even saying that this problem is derived from examples of obstructions to potential alignment approaches. I think Nate’s perspective is that he has some petty good arguments and intuitions about why a sharp left turn will cause novel problems. And so a lot of what I’m saying is that I’m not yet buying it, that I think Nate’s argument has fatal holes in it which are hidden by its vagueness, and that if the arguments are really very important then we should be trying hard to make them more concrete and to address those holes.
Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself “the hard part”?
ARC does theoretical work guided by concrete stories about how a proposed AI system could fail; we are focused on the “legible part” insofar as we try to fix failures for which we can tell concrete stories. I’m not quite sure what you mean by “illegible” and so this might just be a miscommunication, but I think this is the relevant sense of “illegible” so I’ll respond briefly to it.
I think we can tell concrete stories about deceptive alignment; about ontology mismatches making it hard or meaningless to “elicit latent knowledge;” about exploitability of humans making debate impossible; and so on. And I think those stories we can tell seem to do a great job of capturing the reasons why we would expect existing alignment approaches to fail. So if we addressed these concrete stories I would feel like we’ve made real progress. That’s a huge part of my optimism about concrete stories.
It feels to me like either we are miscommunicating about what ARC is doing, or you are saying that those concrete difficulties aren’t the really important failures. That even if an alignment approach addressed all of them, it still wouldn’t represent meaningful progress because the true risk is the risk that cannot be named.
One thing you might mean is that “these concrete difficulties are just shadows of a deeper core.” But I think that’s not actually a challenge to ARC’s approach at all, and it’s not that different from my own view. I think that if you have an intuitive sense of a deep problem, then it’s really great to attack specific instantiations of the problem as a way to learn about the deep core. I feel pretty good about this approach, and I think it’s pretty standard in most disciplines that face problems like this (e.g. if you are deeply confused about physics, it’s good to think a lot about the simplest concrete confusing phenomenon and understand it well; if you are confused about how to design algorithms that overcome a conceptual barrier, it’s good to think about the simplest concrete task that requires crossing that barrier; etc.).
Another thing you might mean is that “these concrete difficulties are distractions from a bigger difficulty that emerges at a later step of the plan.” It’s worth noting that ARC really does try to look at the whole plan and pick the step that is most likely to fail. But I do think it would be a problem for our methodology if there is a good argument about why plans will fail, which won’t let us tell a concrete story about what the failure looks like. My position right now is that I don’t see such an argument; I think we have some vague intuitions, and we have a bunch of examples which do correspond to concrete failure stories. I don’t think there are any examples from which to infer the existence of a difficulty that can’t be captured in concrete stories, and I’m not yet aware of arguments that I find persuasive without any examples. But I’m really quite strongly in the market for such arguments.
Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It’s like the exact opposite of security mindset.
Here’s how the situation feels to me. I know this isn’t remotely fair as a summary of your view, it’s just intended to illustrate where ARC is coming from. (It’s also possible this is a research methodology disagreement, in which case I do just disagree strongly.)
Cryptographer: It seems like our existing proposals for “secure” communication are still vulnerable to man in the middle attacks. Better infrastructure for key distribution is one way to overcome this particular attack, so let’s try to improve that. We can also see how this might fit in with the rest of our security infrastructure to help build to a secure internet, though no doubt the details will change.
Cryptography skeptic: The real difficulty isn’t man in the middle attacks, it’s that security is really hard. By focusing on concrete stuff like man-in-the-middle you are overlooking the real nature of the problem, focusing on the known risks rather than the unknown unknowns. Someone with a true security mindset wouldn’t be fiddling around the edges like this.
I’m not saying that infrastructure for key distribution solves security (and indeed we have huge security problems). I’m saying that working on concrete problems is the right way to make progress in situations like this. I don’t think this is in tension with security mindset. In fact I think effective people with security mindset spend most of their time thinking about concrete risks and how to address them.
It’s great to generalize once you have a bunch of concrete risks and you think there is a deeper underlying pattern. But I think you basically need the examples to learn from, and if there is a real pattern then you should be able to instantiate it in any particular case rather than making reference to the pattern.
- johnswentworth 14 Jul 2022 0:48 UTC
  LW: 63 AF: 29
  3
  AF Parent
  This was a good reply, I basically buy it. Thanks.