abramdemski comments on Formal Inner Alignment, Prospectus

abramdemski May 15, 2021, 3:52 PM
LW: 2 AF: 2
AF
I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using “blind search over a super-broad, probably-even-Turing-complete, space of models” as one of its ingredients. I guess I’m just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs… (Of course I just wind up with a different set of unsolved AGI safety problems instead...)
Wait, you think your prosaic story doesn’t involve blind search over a super-broad space of models??
I think any prosaic story involves blind search over a super-broad space of models, unless/until the prosaic methodology changes, which I don’t particularly expect it to.
I agree that replacing “blind search” with different tools is a very important direction. But your proposal doesn’t do that!
By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they’re instrumentally useful, or they’re a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.
So I guess I can imagine a strategy of saying “mesa-optimization won’t happen” in some circumstance because we’ve somehow ruled out all three of those categories.
This kind of argument does seem like a not-especially-promising path for safety research, in practice. For one thing, it seems hard. Like, we may be wrong about what’s instrumentally useful, or we may overlook part of the space of possible strategies, etc. For another thing, mesa-optimization is at least somewhat incentivized by seemingly almost any training procedure, I would think.
I agree with this general picture. While I’m primarily knocking down bad complexity-based arguments in my post, I would be glad to see someone working on trying to fix them.
...Hmm, in our recent conversation, I might have said that mesa-optimization is not incentivized in predictive (self-supervised) learning. I forget. But if so, I was confused. I have long believed that mesa-optimization is useful for prediction and still do. Specifically, the directly-incentivized kind of “mesa-optimization in predictive learning” entails, for example, searching over different possible approaches to process the data and generate a prediction, and then taking the most promising approach.
There were a lot of misunderstandings in the earlier part of our conversation, so, I could well have misinterpreted one of your points.
But if so, I’m even more struggling to see why you would have been optimistic that your RL scenario doesn’t involve risk due to unintended mesa-optimization.
Anyway, what I should have said was that, in certain types of predictive learning, mesa-optimizers that search over active, real-world-manipulating plans are not incentivized—and then that’s part of an argument that such mesa-optimizers are improbable.
By your own account, the other part would be to argue that they’re not simple, which you haven’t done. They’re not actively disincentivized, because they can use the planning capability to perform well on the task (deceptively). So they can be selected for just as much as other hypotheses, and might be simple enough to be selected in fact.
- Steven Byrnes May 16, 2021, 1:35 AM
  LW: 4 AF: 3
  AF Parent
  Wait, you think your prosaic story doesn’t involve blind search over a super-broad space of models??
  No, not prosaic, that particular comment was referring to the “brain-like AGI” story in my head...
  Like, I tend to emphasize the overlap between my brain-like AGI story and prosaic AI. There is plenty of overlap. Like they both involve “neural nets”, and (something like) gradient descent, and RL, etc.
  By contrast, I haven’t written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I’m thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation.
  In that case, yes there’s a search over a model space, because we need to find the (more complicated cousin of a) PGM world-model. But I don’t think that model space affords the same opportunities for mischief that you would get in, say, a 100-layer DNN. Not having thought about it too hard… :-P
  - abramdemski May 18, 2021, 5:14 PM
    LW: 2 AF: 2
    AF Parent
    No, not prosaic, that particular comment was referring to the “brain-like AGI” story in my head...
    Ah, ok. It sounds like I have been systematically mis-perceiving you in this respect.
    By contrast, I haven’t written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I’m thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation.
    I would have been much more interested in your posts in the past if you had emphasized this aspect more ;p But perhaps you held back on that to avoid contributing to capabilities research.
    In that case, yes there’s a search over a model space, because we need to find the (more complicated cousin of a) PGM world-model. But I don’t think that model space affords the same opportunities for mischief that you would get in, say, a 100-layer DNN. Not having thought about it too hard… :-P
    Yeah, this is a very important question!