paulfchristiano comments on Formal Solution to the Inner Alignment Problem

paulfchristiano 24 Feb 2021 4:15 UTC
LW: 12 AF: 11
0
AF
I don’t think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there’s a trade off. And what’s traded off is a public good: among treacherous models, it is a public good for the models’ moments of treachery to be spread out.
Trying to defect at time T is only a good idea if it’s plausible that your mechanism isn’t going to notice the uncertainty at time T and then query the human. So it seems to me like this argument can never drive P(successful treachery) super low, or else it would be self-defeating (for competent treacherous agents).
The treacherous model has to compute the truth, and then also decide when and how to execute treachery. So the subroutine they run to compute the truth, considered as a model in its own right, seems to me like it must be simpler.
Subroutines of functions aren’t always simpler. Even without treachery this seems like it’s basically guaranteed. If “the truth” is just a simulation of a human, then the truth is a subroutine of the simplest model. But instead you could have a model of the form “Run program X, treat its output as a program, then run that.” Since humans are compressible that will clearly be smaller.
Now you want to say that whatever reasoning a treacherous agent does in order to compress the human, you can just promote that to the outside model as well. But that argument seems to be precisely where the meat is (and it’s the kind of thing I spend most of time on). If that works then it seems like you don’t even need a further solution to inner alignment, just use the simplest model.
(Maybe there is a hope that the intended model won’t be “much” more complex than the treacherous models, without literally saying that it’s the simplest, but at that point I’m back to wondering whether “much” is like 0.01% or 0.00000001%.)
- michaelcohen 24 Feb 2021 11:40 UTC
  LW: 1 AF: 1
  0
  AF Parent
  So it seems to me like this argument can never drive P(successful treachery) super low, or else it would be self-defeating (for competent treacherous agents)
  I don’t follow. Can’t races to the bottom destroy all value for the agents involved? Consider the claim: firms will never set prices to equal marginal cost because that would destroy all profit.
  Now you want to say that whatever reasoning a treacherous agent does in order to compress the human, you can just promote that to the outside model as well. But that argument seems to be precisely where the meat is (and it’s the kind of thing I spend most of time on). If that works then it seems like you don’t even need a further solution to inner alignment, just use the simplest model.
  Yeah, okay. This is my mainline opinion, and I just think about inner alignment in case it’s wrong. So for now we can assume that it is wrong, since there’s not much more to say if it’s right.
  Maybe there is a hope that the intended model won’t be “much” more complex than the treacherous models, without literally saying that it’s the simplest, but at that point I’m back to wondering whether “much” is like 0.01% or 0.00000001%
  That is my next guess, but I hadn’t thought in terms of percentages before this. I had thought of normally distributed noise in the number of bits.
  It’s occurring to me that this question doesn’t matter to our present discussion. What makes the linear regime linear rather logarithmic is that if p(treachery)/p(honest model) is high, that allows for a large number of treacherous models to have greater posterior weight than the truth. But if a single model has n times the posterior weight of the truth, it still only takes one query to the demonstrator to interrupt its treacherous turn, not n queries.
  - paulfchristiano 24 Feb 2021 16:04 UTC
    LW: 3 AF: 3
    0
    AF Parent
    I don’t follow. Can’t races to the bottom destroy all value for the agents involved?
    You are saying that a special moment is a particularly great one to be treacherous. But if P(discovery) is 99.99% during that period, and there is any other treachery-possible period where P(discovery) is small, then that other period would have been better after all. Right?
    This doesn’t seem analogous to producers driving down profits to zero, because those firms had no other opportunity to make a profit with their machine. It’s like you saying: there are tons of countries where firms could use their machines to make stuff and sell it at a profit (more countries than firms). But some of the places are more attractive than others, so probably everyone will sell in those places and drive profits to zero. And I’m saying: but then aren’t those incredibly-congested countries actually worse places to sell? This scenario is only possible if firms are making so much stuff that they can drive profit down to zero in every country, since any country with remaining profit would necessarily be the best place to sell.
    Yeah, okay. This is my mainline opinion, and I just think about inner alignment in case it’s wrong
    It seems to me like this is clearly wrong in the limit (since simple consequentialists would take over simple physics). It also seems probably wrong to me for smaller models (i.e. without huge amounts of internal selection) but it would be good for someone to think about that case more seriously.
    It’s occurring to me that this question doesn’t matter to our present discussion. What makes the linear regime linear rather logarithmic is that if p(treachery)/p(honest model) is high, that allows for a large number of treacherous models to have greater posterior weight than the truth. But if a single model has n times the posterior weight of the truth, it still only takes one query to the demonstrator to interrupt its treacherous turn, not n queries.
    if the treacherous model is 100 bits smaller, then it feels like there must be around 2^100 treacherous models that are all simpler than the intended one. If nothing else, it seems like you could literally add garbage bits to the treacherous models (or useful bits!).
    - michaelcohen 24 Feb 2021 22:19 UTC
      LW: 4 AF: 3
      0
      AF Parent
      If nothing else, it seems like you could literally add garbage bits to the treacherous models
      Okay, sure.
      It seems to me like this is clearly wrong in the limit (since simple consequentialists would take over simple physics).
      It’s not clear to me that there isn’t meaningful overhead involved.
      You are saying that a special moment is a particularly great one to be treacherous. But if P(discovery) is 99.99% during that period, and there is any other treachery-possible period where P(discovery) is small, then that other period would have been better after all. Right?
      I agree with what you’re saying but I don’t see how it contradicts what I was. First, what I had in mind when saying that some timesteps are better for treachery because when the agent acts on a false prediction it has a greater effect on the world, though of course P(discovery) is also relevant. But my point is that when multiple treacherous models pick the same timestep to err, there may be pros and cons to doing this, but one thing that isn’t on the cons list, is that in the long run, it makes our lives easier if they do. Making our lives difficult is a public good for treacherous models.
      So treacherous models won’t be trying to avoid collisions in order to make queries be linear in p(treachery)/p(truth). If P(discovery) increases when multiple models are trying to be treacherous at the same time—which we could go onto discuss; it’s not obvious to me either way as of yet—that will balanced against the inherit variation in some timesteps being a better for treachery than others.
      - paulfchristiano 25 Feb 2021 17:20 UTC
        LW: 3 AF: 3
        0
        AF Parent
        So treacherous models won’t be trying to avoid collisions in order to make queries be linear in p(treachery)/p(truth). If P(discovery) increases when multiple models are trying to be treacherous at the same time—which we could go onto discuss; it’s not obvious to me either way as of yet—that will balanced against the inherit variation in some timesteps being a better for treachery than others.
        If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, then you’re going to get at least as screwed as by the billionth-best opportunity for treachery. Otherwise, one of the treacherous models made a mistake: apparently there was a top-billion treachery opportunities where you didn’t query, so they should have been able to get at least that much value (and you should have lost at least that much value).
        That’s not quite right since your querying behavior will be stochastic (at least from the model’s epistemic position). But you get the same conclusion to within a constant factor: if there are 2 billion treacherous models, then out of the top 2 billion opportunities for treachery, there must be some where our subjective probability of querying is at most ¹⁄₂, and if the models are competent they will get at least as much value as they could by defecting at one of those times, i.e. at least ¹⁄₂ of the harms from the 2-billionth-best opportunity for treachery. So we will lose at least that much value.
        (Those arguments aren’t totally formal, we didn’t even really discharge the “at least N treacherous models assumption” which you need to do in order to show that your decision about when to query isn’t so influenced by any given model.)
        michaelcohen 26 Feb 2021 11:30 UTC
        LW: 1 AF: 1
        0
        AF Parent
        If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, then you’re going to get at least as screwed as by the billionth-best opportunity for treachery.
        The treacherous models make their decisions first, and whether we query at a timestep depends on whether they chose that moment to be treacherous. We don’t have to say: this timestep is one of the billion best for treachery—guess we had better query. The decision process a treacherous model faces is:
        From past observations, does it seem alpha is low enough that an honest model is in the top few?
        If no, [let’s just assume whatever happens here is bad]
        If yes, is the honest model going to leave the top few any time?
        If yes, wait. [let’s assume this ends badly]
        If no, admit defeat, because whenever you err, that will be when a query gets triggered. Salvage the situation by trying to get as much mileage as you can by decreasing probability on some action the demonstrator would take that you don’t like much in order to shift probability mass to some other action they would take that you like a little better. OR
        Hope that 1) other treacherous models, like soldiers on the front line storming the enemy and facing certain death so that their brothers behind them can win the battle, give erring a go anyway. And hope that 2) the operators quietly increase alpha after the billionth query. Acting on this hope, wait. (The front-line treacherous models are not entirely self-sacrificial; they might figure they might as well hope they were wrong about alpha. But the models know when queries happen, because every observation they get comes from a query, and it wouldn’t take long to figure out that alpha really truly isn’t high enough for them to execute treachery successfully).
        Above in the thread, I was talking about the game theory of step 6 (although I maybe wasn’t making the case as strongly as I could have). Which models would ever volunteer? They don’t care about the long term goal of exhausting our query resources so other treacherous models can capitalize on a quietly increased alpha. All that is to say, I don’t follow the comment I excerpted.
        (In case this comes up later: I think about the long-run percentage of query-timesteps that we would tolerate, not the total number. So if we ever have cause to consider a case where attempted treachery “merely gets delayed” but still eventually happens, and still results in a query, then depending on how much it is delayed, that could still qualify as win).
    - michaelcohen 24 Feb 2021 22:38 UTC
      LW: 1 AF: 1
      0
      AF Parent
      This is a bit of a sidebar:
      I’m curious what you make of the following argument.
      When an infinite sequence is sampled from a true model $μ_{0}$ , there is likely to be another treacherous model $μ_{1}$ which is likely to end up with greater posterior weight than an honest model ${^μ}_{0}$ , and greater still than the posterior on the true model $μ_{0}$ .
      If the sequence were sampled from $μ_{1}$ instead, the eventual posterior weight on $μ_{1}$ will probably be at least as high.
      When an infinite sequence is sampled from a true model $μ_{1}$ , there is likely to be another treacherous model $μ_{2}$ , which is likely to end up with greater posterior weight than an honest model ${^μ}_{1}$ , and greater still than the posterior on the true model $μ_{1}$ .
      And so on.
      - paulfchristiano 25 Feb 2021 17:10 UTC
        LW: 2 AF: 2
        0
        AF Parent
        The first treacherous model works by replacing the bad simplicity prior with a better prior, and then using the better prior to more quickly infer the true model. No reason for the same thing to happen a second time.
        (Well, I guess the argument works if you push out to longer and longer sequence lengths—a treacherous model will beat the true model on sequence lengths a billion, and then for sequence lengths a trillion a different treacherous model will win, and for sequence lengths a quadrillion a still different treacherous model will win. Before even thinking about the fact that each particular treacherous model will in fact defect at some point and at that point drop out of the posterior.)
        michaelcohen 26 Feb 2021 10:44 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Does it make sense to talk about ${ˇ μ}_{1}$ , which is like $μ_{1}$ in being treacherous, but is uses the true model $μ_{0}$ instead of the honest model ${^μ}_{0}$ ? I guess you would expect ${ˇ μ}_{1}$ to have a lower posterior than $μ_{0}$ ?
        What links here?
        Arguing about LessWrong’s Engagement with External / Academic Research: CIRL as a Case Study by Ben Pace (4 Mar 2025 6:05 UTC; 28 points)