abramdemski comments on Re-Define Intent Alignment?

abramdemski Jul 22, 2021, 8:40 PM
LW: 4 AF: 4
AF
(Meta: was this meant to be a question?)
I originally conceived of it as such, but in hindsight, it doesn’t seem right.
In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.
I don’t think this is actually a con of the generalization-focused approach.
By no means did I intend it to be a con. I’ll try to edit to clarify. I think it is a real pro of the generalization-focused approach that it does not rely on models having mesa-objectives (putting it in Evan’s terms, there is a real possibility of addressing objective robustness without directly addressing inner alignment). So, focusing on objective robustness seems like a potential advantage—it opens up more avenues of attack. Plus, the generalization-focused approach requires a much weaker notion of “outer alignment”, which may be easier to achieve as well.
But, of course, it may also turn out that the only way to achieve objective robustness is to directly tackle inner alignment. And it may turn out that the weaker notion of outer alignment is insufficient in reality.
Are you the historical origin of the robustness-centric approach? I noticed that Evan’s post has the modified robustness-centric diagram in it, but I don’t know if it was edited to include that. The “Objective Robustness and Inner Alignment Terminology” post attributes it to you (at least, attributes a version of it to you). (I didn’t look at the references there yet.)
- Rohin Shah Jul 23, 2021, 12:42 PM
  LW: 11 AF: 8
  AF Parent
  Are you the historical origin of the robustness-centric approach?
  Idk, probably? It’s always hard for me to tell; so much of what I do is just read what other people say and make the ideas sound sane to me. But stuff I’ve done that’s relevant:
  - Talk at CHAI saying something like “daemons are just distributional shift” in August 2018, I think. (I remember Scott attending it.)
  - Talk at FHI in February 2020 that emphasized a risk model where objectives generalize but capabilities don’t.
  - Talk at SERI conference a few months ago that explicitly argued for a focus on generalization over objectives.
  Especially relevant stuff other people have done that has influenced me:
  - Two guarantees (arguably this should be thought of as the origin)
  - 2-D Robustness
  (My views were pretty set by the time Evan wrote the clarifying inner alignment terminology post; it’s possible that his version that’s closer to generalization-focused was inspired by things I said, you’d have to ask him.)
  - abramdemski Jul 26, 2021, 9:19 PM
    LW: 5 AF: 5
    AF Parent
    I’ve watched your talk at SERI now.
    One question I have is how you hope to define a good notion of “acceptable” without a notion of intent. In your talk, you mention looking at why the model does what it does, in addition to just looking at what it does. This makes sense to me (I talk about similar things), but, it seems just about as fraught as the notion of mesa-objective:
    It requires approximately the same “magic transparency tech” as we need to extract mesa-objectives.
    Even with magical transparency tech, it requires additional insight as to which reasoning is acceptable vs unacceptable.
    If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason? More generally, what do you think “acceptability” might look like?
    (By no means do I mean to say your view is crazy; I am just looking for your explanation.)
    - Rohin Shah Jul 27, 2021, 7:30 AM
      LW: 3 AF: 3
      AF Parent
      One question I have is how you hope to define a good notion of “acceptable” without a notion of intent.
      I don’t hope this; I expect to use a version of “acceptable” that uses intent. I’m happy with “acceptable” = “trying to do what we want”.
      If you are pessimistic about extracting mesa-objectives, why are you optimistic about providing feedback about how to reason?
      I’m pessimistic about mesa-objectives existing in actual systems, based on how people normally seem to use the term “mesa-objective”. If you instead just say that a “mesa objective” is “whatever the system is trying to do”, without attempting to cash it out as some simple utility function that is being maximized, or the output of a particular neuron in the neural net, etc, then that seems fine to me.
      One other way in which “acceptability” is better is that rather than require it of all inputs, you can require it of all inputs that are reasonably likely to occur in practice, or something along those lines. (And this is what I expect we’ll have to do in practice given that I don’t expect to fully mechanistically understand a large neural network; the “all inputs” should really be thought of as a goal we’re striving towards.) Whereas I don’t see how you do this with a mesa-objective (as the term is normally used); it seems like a mesa-objective must apply on any input, or else it isn’t a mesa-objective.
      I’m mostly not trying to make claims about which one is easier to do; rather I’m saying “we’re using the wrong concepts; these concepts won’t apply to the systems we actually build; here are some other concepts that will work”.
      - abramdemski Jul 27, 2021, 2:56 PM
        LW: 5 AF: 5
        AF Parent
        All of that made perfect sense once I thought through it, and I tend to agree with most it. I think my biggest disagreement with you is that (in your talk) you said you don’t expect formal learning theory work to be relevant. I agree with your points about classical learning theory, but the alignment community has been developing basically-classical-learning-theory tools which go beyond those limitations. I’m optimistic that stuff like Vanessa’s InfraBayes could help here.
        Granted, there’s a big question of whether that kind of thing can be competitive. (Although there could potentially be a hybrid approach.)
        Rohin Shah Jul 28, 2021, 7:45 AM
        LW: 3 AF: 3
        AF Parent
        My central complaint about existing theoretical work is that it doesn’t seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do.
        I don’t currently see how any of the alignment community’s tools address that complaint; for example I don’t think the InfraBayes work so far is making an interesting assumption about reality. Perhaps future work will address this though?
        abramdemski Jul 28, 2021, 3:12 PM
        LW: 4 AF: 4
        AF Parent
        InfraBayes doesn’t look for the regularity in reality that NNs are taking advantage of, agreed. But InfraBayes is exactly about “what kind of regularity assumptions can we realistically make about reality?” You can think of it as a reaction to the unrealistic nature of the regularity assumptions which Solomonoff induction makes. So it offers an answer to the question “what useful+realistic regularity assumptions could we make?”
        The InfraBayesian answer is “partial models”. IE, the idea that even if reality cannot be completely described by usable models, perhaps we can aim to partially describe it. This is an assumption about the world—not all worlds can be usefully described by partial models. However, it’s a weaker assumption about the world than usual. So it may not have presented itself as an assumption about the world in your mind, since perhaps you were thinking more of stronger assumptions.
        If it’s a good answer, it’s at least plausible that NNs work well for related reasons.
        But I think it also makes sense to try to get at the useful+realistic regularity assumptions from scratch, rather than necessarily making it all about NNs
        Rohin Shah Jul 28, 2021, 6:22 PM
        LW: 2 AF: 2
        AF Parent
        This is an assumption about the world—not all worlds can be usefully described by partial models.
        They can’t? Why not?
        Maybe the “usefully” part is doing a lot of work here—can all worlds be described (perhaps not usefully) by partial models? If so, I think I have the same objection, since it doesn’t seem like any of the technical results in InfraBayes depend on some notion of “usefulness”.
        (I think it’s pretty likely I’m just flat out wrong about something here, given how little I’ve thought about InfraBayesianism, but if so I’d like to know how I’m wrong.)
        abramdemski Aug 2, 2021, 6:07 PM
        LW: 3 AF: 2
        AF Parent
        They can’t? Why not?
        Answer 1
        I meant to invoke a no-free-lunch type intuition; we can always construct worlds where some particular tool isn’t useful.
        My go-to would be “a world that checks what an InfraBayesian would expect, and does the opposite”. This is enough for the narrow point I was trying to make (that InfraBayes does express some kind of regularity assumption about the world), but it’s not very illustrative or compelling for my broader point (that InfraBayes plausibly addresses your concerns about learning theory). So I’ll try to tell a better story.
        Answer 2
        I might be describing logically impossible (or at least uncomputable) worlds here, but here is my story:
        Solomonoff Induction captures something important about the regularities we see in the universe, but it doesn’t explain NN learning (or “ordinary human learning”) very well, because NNs and humans mostly use very fast models which are clearly much smaller (in time-complexity and space-complexity) than the universe. (Solomonoff induction is closer to describing human science, which does use these very simple but time/space-complex models.)
        So there’s this remaining question of induction: why can we do induction in practice? (IE, with NNs and with nonscientific reasoning)
        InfraBayes answers this question by observing that although we can’t easily use Solomonoff-like models of the whole universe, there are many patterns we can take advantage of which can be articulated with partial models.
        This didn’t need to be the case. We could be in a universe in which you need to fully model the low-level dynamics in order to predict things well at all.
        So, a regularity which InfraBayes takes advantage of is the fact that we see multi-scale phenomena—that simple low-level rules often give rise to simple high-level behavior as well.
        I say “maybe I’m describing logically impossible worlds” here because it is hard to imagine a world where you can construct a computer but where you don’t see this kind of multi-level phenomena. Mathematics is full of partial-model-type regularities; so, this has to be a world where mathematics isn’t relevant (or, where mathematics itself is different).
        But Solomonoff induction alone doesn’t give a reason to expect this sort of regularity. So, if you imagine a world being drawn from the Solomonoff prior vs a world being drawn from a similar InfraBayes prior, I think the InfraBayes prior might actually generate worlds more like the one we find ourselves in (ie, InfraBayes contains more information about the world).
        (Although actually, I don’t know how to “sample from an infrabayes prior”...)
        “Usefully Describe”
        Maybe the “usefully” part is doing a lot of work here—can all worlds be described (perhaps not usefully) by partial models? If so, I think I have the same objection, since it doesn’t seem like any of the technical results in InfraBayes depend on some notion of “usefulness”.
        Part of what I meant by “usefully describe” was to contrast runnable models from non-runnable models. EG, even if Solomonoff induction turned out to be the more accurate prior for dealing with our world, it’s not very useful because it endorses hypotheses which we can’t efficiently run.
        I mentioned that I think InfraBayes might fit the world better than Solomonoff. But what I actually predict more strongly is that if we compare time-bounded versions of both priors, time-bounded InfraBayes would do better thanks to its ability to articulate partial models.
        I think it’s also worth pointing out that the technical results of InfraBayes do in fact address a notion of usefulness: part of the point of InfraBayes is that it translates to decision-making learning guarantees (eg, guarantees about the performance of RL agents) better than Bayesian theories do. Namely, if there is a partial model such that the agent would achieve nontrivial reward if it believed it, then the agent will eventually do at least that well. So, to succeed, InfraBayes relies on an assumption about the world—that there is a useful partial model. (This is the analog of the Solomonoff induction assumption that there exists a best computable model of the world.)
        So although it wasn’t what I was originally thinking, it would also be reasonable to interpret “usefully describe” as “describe in a way which gives nontrivial reward bounds”. I would be happy to stand by this interpretation as well: as an assumption about the real world, I’m happy to assert that there are usually going to be partial models which (are accurate and) give good reward bounds.
        What I Think You Should Think
        I think you should think that it’s plausible we will have learning-theoretic ideas which will apply directly to objects of concern, in the sense of under some plausible assumptions about the world, we can argue a learning-theoretic guarantee for some system we can describe, which theoretically addresses some alignment concern.
        I don’t want to strongly argue that you should think this will be competitive with NNs or anything like that. Obviously I prefer worlds where that’s true, but I am not trying to argue that. Even if in some sense InfraBayes (or some other theory) turns out to explain the success of NNs, that does not actually imply it’ll give rise to something competitive with NNs.
        I’m wondering if that’s a crux for your interest. Honestly, I don’t really understand what’s going on behind this remark:
        My central complaint about existing theoretical work is that it doesn’t seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do.
        Why is this your central complaint about existing theoretical work? My central complaint is that pre-existing learning theory didn’t give us what we need to slot into a working alignment argument. In your presentation you listed some of those complaints, too. This seems more important to me that whether we can fully explain the success of large NNs.
        My original interpretation about your remark was that you wanted to argue “learning theory makes bad assumptions about the world. To make strong arguments for alignment, we need to make more realistic assumptions. But these more realistic assumptions are necessarily of an empirical, non-theoretic nature.” But I think InfraBayes in fact gets us closer to assumptions that are (a) realistic and (b) suited to arguments we want to make about alignment.
        In other words, I had thought that you had (quite reasonably!) given up on learning theory because its results didn’t seem relevant. I had hoped to rekindle your interest by pointing out that we can now do much better than 90s-era learning theory, in ways that seem relevant for EG objective robustness.
        My personal theory about large NNs is that they act as a mixture model. It would be surprising if I told you that some genetic algorithm found a billion-bit program that described the data perfectly and then generalized well. It would be much less surprising if I told you that this billion-bit program was actually a mixture model that had been initialized randomly and then tuned by the genetic algorithm. From a Bayesian perspective, I expect a large random mixture model which then gets tuned to eliminate sub-models which are just bad on the data to be a pretty good approximation of my posterior, and therefore, I expect it to generalize well.
        But my beliefs about this don’t seem too cruxy for my beliefs about what kind of learning theory will be useful for alignment.
        Rohin Shah Aug 3, 2021, 7:22 AM
        LW: 2 AF: 2
        AF Parent
        Why is this your central complaint about existing theoretical work?
        Sorry, I meant that that was my central complaint about existing theoretical work that is trying to explain neural net generalization. (I was mostly thinking of work outside of the alignment community.) I wasn’t trying to make a claim about all theoretical work.
        It’s my central complaint because we ~know that such an assumption is necessary (since the same neural net that generalizes well on real MNIST can also memorize a randomly labeled MNIST where it will obviously fail to generalize).
        InfraBayes answers this question by observing that although we can’t easily use Solomonoff-like models of the whole universe, there are many patterns we can take advantage of which can be articulated with partial models.
        I feel pretty convinced by this :) In particular the assumption on the real world could be something like “there exists a partial model that describes the real world well enough that we can prove a regret bound that is not vacuous” or something like that. And I agree this seems like a reasonable assumption.
        Even if in some sense InfraBayes (or some other theory) turns out to explain the success of NNs, that does not actually imply it’ll give rise to something competitive with NNs.
        Tbc I would see this as a success.
        In other words, I had thought that you had (quite reasonably!) given up on learning theory because its results didn’t seem relevant. I had hoped to rekindle your interest by pointing out that we can now do much better than 90s-era learning theory, in ways that seem relevant for EG objective robustness.
        I am interested! I listed it as one of the topics I saw as allowing us to make claims about objective robustness. I’m just saying that the current work doesn’t seem to be making much progress (I agree now though that InfraBayes is plausibly on a path where it could eventually help).
        It would be surprising if I told you that some genetic algorithm found a billion-bit program that described the data perfectly and then generalized well. It would be much less surprising if I told you that this billion-bit program was actually a mixture model that had been initialized randomly and then tuned by the genetic algorithm.
        Fwiw I don’t feel the force of this intuition, they seem about equally surprising (but I agree with you that it doesn’t seem cruxy).
        abramdemski Aug 4, 2021, 6:21 PM
        LW: 2 AF: 2
        AF Parent
        Great, I feel pretty resolved about this conversation now.

abramdemski comments on Re-Define Intent Alignment?

Answer 1

Answer 2

“Usefully Describe”

What I Think You Should Think