Daniel Kokotajlo comments on Two Tales of AI Takeover: My Doubts

Daniel Kokotajlo 24 Mar 2024 1:40 UTC
2 points
0
Thanks! Once again this is great. I think it’s really valuable for people to start theorizing/hypothesizing about what the internal structure of AGI cognition (and human cognition!) might be like at this level of specificity.
Thinking step by step:
My initial concern is that there might be a bit of a dilemma: Either (a) the cognition is in-all-or-most-contexts-thinking-about-future-world-states-in-which-harm-doesn’t-happen in some sense, or (b) it isn’t fair to describe it as harmlessness. Let me look more closely at what you said and see if this holds up.
1. However, μ_H needn’t have a context-independent outcome-preference for $O^{*}$ = “my actions don’t cause significant harm”, because it may not explicitly represent $O^{*}$ as a possible state of affairs across all contexts.
  For example, the ‘harmlessness’ concept could be computationally significant in shaping the feasible option set or the granularity of outcome representations, without ever explicitly representing ‘the world is in a state where my actions are harmless’ as a discrete outcome to be pursued.
In the example, the ‘harmlessness’ concept shapes the feasible option set, let’s say. But I feel like there isn’t an important difference between ‘concept X is applied to a set of options to prune away some of them that trigger concept X too much (or not enough)’ and ‘concept X is applied to the option-generating machinery in such a way that reliably ensures that no options that trigger concept X too much (or not enough) will be generated.’ Either way, it seems like it’s fair to say that the system (dis)prefers X. And when X is inherently about some future state of the world—such as whether or not harm has occurred—then it seems like something consequentialist is happening. At least that’s how it seems to me. Maybe it’s not helpful to argue about how to apply words—whether the above is ‘fair to say’ for example—and more fruitful to ask: What is your training goal? Presented with a training goal (“This should be a mechanistic description of the desired model that explains how you want it to work—e.g. “classify cats using human vision heuristics”—not just what you want it to do—e.g. “classify cats.”), we can then argue about training rationale (i.e. whether the training environment will result in the training goal being achieved.)

You’ve said a decent amount about this already—your ‘training goal’ so to speak is a system which may frequently think about the consequnces of its actions and choose actions on that basis, but for which the ‘final goals’ / ‘utility function’ / ‘preferences’ with which it uses to pick actions are not context-indepeendent but rather highly context-dependent. It’s thus not a coherent agent, so to speak; it’s not consistently pushing the world in any particular direction on purpose, but rather flitting from goal to goal depending on the situation—and the part of it that determines what goal to flit to is NOT itself well-described as goal-directed, but rather something more like a look-up-table that has been shaped by experience to result in decent performance. (Or maybe you’d say it might indeed look goal-directed but only for myopic goals, i.e. just focused on performance in a particular limited episode?)

(And thus, you go on to argue, it won’t result in deceptive alignment or reward-seeking behavior. Right?)

I fear I may be misunderstanding you so if you want to clarify what I got wrong about the above that would be helpful!