michaelcohen comments on Threat Model Literature Review

michaelcohen 2 Nov 2022 21:26 UTC
LW: 3 AF: 2
1
AF
Thank you for this review! A few comments on the weaknesses of my paper.
In particular, it explicitly says the argument does not apply to supervised learning.
Hardly a weakness if supervised learning is unlikely to be an existential threat!
Strength: Does not make very concrete assumptions about the AGI development model.
Weakness: Does not talk much about how AGI is likely to be developed, unclear which of the assumptions are more/less likely to hold for AGI being developed using the current ML paradigm.
The fact that the argument holds equally well no matter what kind of function approximation is used to do inference is, I think, a strength of the argument. It’s hard to know what future inference algorithms will look like, although I do think there is a good chance that they will look a lot like current ML. And it’s very important that the argument doesn’t lump together algorithms where outputs are selected to imitate a target (imitation learners / supervised learners) vs. algorithms where outputs are selected to accomplish a long-term goal. These are totally different algorithms, so analyses of their behavior should absolutely be done separately. The claim “we can analyze imitation learners imitating humans together with RL agents, because both times we could end up with intelligent agents” strikes me as just as suspect as the claim “we can analyze the k-means algorithm together with a vehicle routing algorithm, because both will give us a partition over a set of elements.” (The claim “we can analyze imitation learners alongside the world-model of a model-based RL agent” is much more reasonable, since these are both instances of supervised learning.)
Assumes the agent will be aiming to maximize reward without justification, i.e. why does it not have other motivations, perhaps due to misgeneralizing about its goal?
Depending on the meaning of “aiming to maximize reward”, I have two different responses. In one sense, I claim “aiming to maximize reward” would be the nature of a policy that performs sufficiently strongly according to the RL objective. (And aiming to maximize inferred utility would be the nature of a policy that performs sufficiently strongly according to the CIRL objective.) But yes, even though I claim this simple position stands, a longer discussion would help establish that.
There’s another sense in which you can say that an agent that has a huge inductive bias in favor of $μ^{dist}$ , and so violates Assumption 3, is not aiming to maximize reward. So the argument accounts for this possibility. Better yet, it provides a framework for figuring out when we can expect it! See, for example, my comment in the paper that I think an arbitrarily advanced RL chess player would probably violate Assumption 3. I prefer the terminology that says this chess player is aiming to maximize reward, but is dead sure winning at chess is necessary for maximizing reward. But if these are the sort of cases you mean to point to when you suggest the possibility of an agent “not maximizing reward”, I do account for those cases.
Arguments made in the paper for why an agent intervening in the reward would have catastrophic consequences are somewhat brief/weak.
Are there not always positive returns to energy/resource usage when it comes to maximizing the probability that the state of a machine continues to have certain property (i.e. reward successfully controlled)? And our continued survival definitely requires some energy/resources. To be clear, catastrophic consequences follow from an advanced agent intervening the provision of reward in the way that would be worth doing. Catastrophic consequences definitely don’t follow from a half-hearted and temporary intervention in the provision of reward.
- zac_kenton 3 Nov 2022 13:43 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Thanks for the comment Michael. Firstly, just wanted to clarify the framing of this literature review—when considering strengths and weaknesses of each threat model, this was done in light of what we were aiming to do: generate and prioritise alignment research projects—rather than as an all-things-considered direct critique of each work (I think that is best done by commenting directly on those articles etc). I’ll add a clarification of that at the top. Now to your comments:
  
  To your 1st point: I think the lack of specific assumptions about the AGI development model is both a strength and a weakness. Regarding the weakness, we mention it because it makes it harder to generate and prioritize research projects. It could be more helpful to say more explicitly, or earlier in the article what kind of systems you’re considering, perhaps pointing to the closest current prosaic system, or explaining why current systems are nothing like what you imagine the AGI development model is like.
  
  On your 2nd point: What I meant was more “what about goal misgeneralization? Wouldn’t that mean the agent is likely to not be wireheading, and pursuing some other goal instead?”—you hint at this at the end of the section on supervised learning but that was in the context of whether a supervised learner would develop a misgeneralized long-term goal, and settled on being agnostic there.
  
  On your 3rd point: It could have been interesting to read arguments for why would it need all available energy to secure its computer, rather than satisficing at some level. Or some detail on the steps for how it builds the technology to gather the energy, or how it would convert that into defence.
  - michaelcohen 4 Nov 2022 12:17 UTC
    LW: 1 AF: 1
    0
    AF Parent
    On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed imply non-wireheading behavior, and I wouldn’t even call it misgeneralization; I think that would be a perfectly valid interpretation of past rewards. So that’s why I spend so much time discussing relative credence in those models.