Violet Hour comments on Two Tales of AI Takeover: My Doubts

Violet Hour 22 Mar 2024 17:00 UTC
3 points
0
Hmmm … yeah, I think noting my ambiguity about ‘values’ and ‘outcome-preferences’ is good pushback —thanks for helping me catch this! Spent some time trying to work out what I think.
Ultimately, I do want to say μ_H has context-independent values, but not context-independent outcome preferences. I’ll try to specify this a little more.
Justification Part I: Definitions
I said that a policy has preferences over outcomes when “there are states of the world the policy finds more or less valuable … ”, but I didn’t specify what it means to find states of the world more or less “valuable”. I’ll now say that a system (dis)values some state of the world $O$ when:
1. It has an explicit representation of $O$ as a possible state of the world, and
2. The prospect of the system’s outputs resulting in $O$ is computationally significant in the system’s decision-making.
So, a system a context-independent outcome-preference for a state of the world $O$ if the system has an outcome-preference for $O$ across all contexts. I think reward maximization and deceptive alignment require such preferences. I’ll also define what it means to value a concept.
A system (dis)values some concept $C$ (e.g., ‘harmlessness’) when that concept $C$ computationally significant in the system’s decision-making.
Concepts are not themselves states of the world (e.g., ‘dog’ is a concept, but doesn’t describe a state of the world). Instead, I think of concepts (like ‘dog’ or ‘harmlessness’) as something like a schema (or algorithm) for classifying possible inputs according to their $C$ -ness (e.g., an algorithm for classifying possible inputs as dogs, or classifying possible inputs as involving ‘harmful’ actions).
With these definitions in mind, I want to say:
1. μ_H has ‘harmlessness’ as a context-independent value, because the learned concept of ‘harmlessness’ consistently shapes the policy’s behavior across a range of contexts (e.g., by influencing things like the generation of its feasible option set).
2. However, μ_H needn’t have a context-independent outcome-preference for $O^{*}$ = “my actions don’t cause significant harm”, because it may not explicitly represent $O^{*}$ as a possible state of affairs across all contexts.
  1. For example, the ‘harmlessness’ concept could be computationally significant in shaping the feasible option set or the granularity of outcome representations, without ever explicitly representing ‘the world is in a state where my actions are harmless’ as a discrete outcome to be pursued.
I struggled to make this totally explicit, but I’ll offer a speculative below of how μ_H’s cognition might work without CP.
Justification Part II: Decision-Making Without CP
I’ll start by stealing an old diagram from the shard theory discord server (cf. cf0ster). My description is closest to the picture of Agent Design B, and I’ll make free use of ‘shards’ to refer to ‘decision-influences’.
So, here’s how μ_H’s cognition might look in the absence of CP:
1. μ_H takes in some input request.
  1. E.g., suppose it receives an input from someone claiming to be a child, who is looking for help debugging her code.
2. Together, the input and μ_H’s learned concepts together generate a mental context.
  1. The policy’s mental context is a cognitive description of the state of the total network. In this example, μ_H’s mental context might be: “Human child has just given me a coding problem” (though it could ofc be more complicated).
3. The mental context activates a set of ‘shards’ (or decision-influences).
  1. In this example, the policy might have a “solve coding problem” shard, and a “be considerate” shard.
4. Activated shards ‘bid’ for actions with certain properties.
  1. E.g., “pro-gentle shard” influences decision-making by bringing encouraging thoughts to mind, “pro-code-solving shard” influences decision-making by generating thoughts like “check for common code error #5390”.
5. Bids from shards generate an initial ‘option set’: this is a set of actions that meet the properties bid for by previously activated shards.
  1. These might be actions like “check for common error #5390, then present corrected code to the child, alongside encouraging words”, alongside considerations like “ensure response is targeted”, “ensure response is considerate”.
6. Mental context “I’m presented with a set of actions” activates the “planning shard”, which selects an action based on contextually-generated considerations.
  1. E.g., plans might be assessed against some kind of (weighted) vote count of activated shards.
  2. The weighted vote count generates preferences over the salient outcomes caused by actions in the set.
7. μ_H performs the action.
I don’t want to say “future AGI cognition will be well-modeled using Steps 1-7”. And there’s still a fair amount of imprecision in the picture I suggest. Still, I do think it’s a coherent picture of how the learned concept ‘harmlessness’ consistently plays a causal role in μ_H’s behavior, without assuming consequentialist preferences.
(I expect you’ll still have some issues with this picture, but I can’t currently predict why/how)
- Daniel Kokotajlo 24 Mar 2024 1:40 UTC
  2 points
  0
  Parent
  Thanks! Once again this is great. I think it’s really valuable for people to start theorizing/hypothesizing about what the internal structure of AGI cognition (and human cognition!) might be like at this level of specificity.
  Thinking step by step:
  My initial concern is that there might be a bit of a dilemma: Either (a) the cognition is in-all-or-most-contexts-thinking-about-future-world-states-in-which-harm-doesn’t-happen in some sense, or (b) it isn’t fair to describe it as harmlessness. Let me look more closely at what you said and see if this holds up.
  However, μ_H needn’t have a context-independent outcome-preference for $O^{*}$ = “my actions don’t cause significant harm”, because it may not explicitly represent $O^{*}$ as a possible state of affairs across all contexts.
  For example, the ‘harmlessness’ concept could be computationally significant in shaping the feasible option set or the granularity of outcome representations, without ever explicitly representing ‘the world is in a state where my actions are harmless’ as a discrete outcome to be pursued.
  In the example, the ‘harmlessness’ concept shapes the feasible option set, let’s say. But I feel like there isn’t an important difference between ‘concept X is applied to a set of options to prune away some of them that trigger concept X too much (or not enough)’ and ‘concept X is applied to the option-generating machinery in such a way that reliably ensures that no options that trigger concept X too much (or not enough) will be generated.’ Either way, it seems like it’s fair to say that the system (dis)prefers X. And when X is inherently about some future state of the world—such as whether or not harm has occurred—then it seems like something consequentialist is happening. At least that’s how it seems to me. Maybe it’s not helpful to argue about how to apply words—whether the above is ‘fair to say’ for example—and more fruitful to ask: What is your training goal? Presented with a training goal (“This should be a mechanistic description of the desired model that explains how you want it to work—e.g. “classify cats using human vision heuristics”—not just what you want it to do—e.g. “classify cats.”), we can then argue about training rationale (i.e. whether the training environment will result in the training goal being achieved.)
  
  You’ve said a decent amount about this already—your ‘training goal’ so to speak is a system which may frequently think about the consequnces of its actions and choose actions on that basis, but for which the ‘final goals’ / ‘utility function’ / ‘preferences’ with which it uses to pick actions are not context-indepeendent but rather highly context-dependent. It’s thus not a coherent agent, so to speak; it’s not consistently pushing the world in any particular direction on purpose, but rather flitting from goal to goal depending on the situation—and the part of it that determines what goal to flit to is NOT itself well-described as goal-directed, but rather something more like a look-up-table that has been shaped by experience to result in decent performance. (Or maybe you’d say it might indeed look goal-directed but only for myopic goals, i.e. just focused on performance in a particular limited episode?)
  
  (And thus, you go on to argue, it won’t result in deceptive alignment or reward-seeking behavior. Right?)
  
  I fear I may be misunderstanding you so if you want to clarify what I got wrong about the above that would be helpful!

Violet Hour comments on Two Tales of AI Takeover: My Doubts

Justification Part I: Definitions

Justification Part II: Decision-Making Without CP