By contrast, it at least does not seem obvious that μH needs to encode well-specified outcome-preferences that motivate its responses across episodes. Our HHH-assistant μHwill, given some input, need to possess situation-relative preferences-over-outcomes — these might include (say) prompt-induced goals to perform a certain financial trade, or even longer-term goals to help a company remain profitable. Still, such ‘goals’ may emerge in a purely prompt-dependent manner, without the policy pursuing local goals in virtue of its underlying consequentialist preferences.
Isn’t “Harmlessness” an example of CP? If the model is truly Harmless that means it is thinking about how to avoid causing harm to people, and that this thinking isn’t limited to specific prompts but rather is baked in to its behavior more generally.
I don’t think so. Suppose Alex is an AI in training, and Alex endorses the value of behaving “harmlessly”. Then, I think the following claims are true of Alex:
Alex consistently cares about producing actions that meet a given criteria. So, Alex has some context-independent values.
On plausible operationalizations of ‘harmlessness’, Alex is also likely to possess, at given points in time, context-dependent, beyond-episode outcome-preferences. When Alex considers which actions to take (based on harmlessness), their actions are (in part) determined by what states of the world are likely to arise after their current training episode is over.
That said, I don’t think Alex needs to have consequentialist preferences. There doesn’t need to be some specific state of the world that they’re pursuing at all points in time.
To elaborate: this view says that “harmlessness” acts as something akin to a context-independent filter over possible (trajectory, outcome) pairs. Given some instruction, at a given point in time, Alex forms some context-dependent outcome-preferences.
That is, one action-selection criteria might be ‘choose an action which best satisfies my consequentialist preferences’. Another action-selection criteria might be: ‘follow instructions, given (e.g.) harmlessness constraints’.
The latter criterion can be context-independent, while only generating ‘consequentialist preferences’ in a context-dependent manner.
So, when Alex isn’t provided with instructions, they needn’t be well-modeled as possessing any outcome-preferences. I don’t think that a model which meets a minimal form of behavioral consistency (e.g., consistently avoiding harmful outputs) is enough to get you consequentialist preferences.
The Preference Assumption: By default, AI training will result in policies endogenously forming context-independent, beyond-episode outcome-preferences.
Now you are saying that if Alex does end up Harmless as we hoped, it will have context-independent values, and also it will have context dependent beyond episode outcome-preferences, but it won’t have context-independent beyond-episode outcome-preferences? It won’t have “some specific state of the world” that it’s pursuing at all points in time?
First of all, I didn’t think CP depended on there being a specific state of the world you were aiming for. (what does that mean anyway?) It just meant you had some context-independent beyond-episode outcome-preferences (and that you plan towards them). Seems to me that ‘harmlessness’ = ‘my actions don’t cause significant harm’ (which is an outcome-preference not limited to the current episode) and it seems to me that this is also context-independent because it is baked into Alex via lots of training rather than just something Alex sees in a prompt sometime.
I have other bigger objections to your arguments but this one is the one that’s easiest to express right now. Thanks for writing this post btw it seems to me to be a more serious and high-quality critique of the orthodox view than e.g. Quintin & Nora’s stuff.
Hmmm … yeah, I think noting my ambiguity about ‘values’ and ‘outcome-preferences’ is good pushback —thanks for helping me catch this! Spent some time trying to work out what I think.
Ultimately, I do want to say μH has context-independent values, but not context-independent outcome preferences. I’ll try to specify this a little more.
Justification Part I: Definitions
I said that a policy has preferences over outcomes when “there are states of the world the policy finds more or less valuable … ”, but I didn’t specify what it means to find states of the world more or less “valuable”. I’ll now say that a system (dis)valuessomestate of the worldO when:
It has an explicit representation of Oas a possible state of the world, and
The prospect of the system’s outputs resulting in O is computationally significant in the system’s decision-making.
So, a system a context-independent outcome-preference for a state of the world O if the system has an outcome-preference for O across all contexts. I think reward maximization and deceptive alignment require such preferences. I’ll also define what it means to value a concept.
A system (dis)valuessomeconceptC (e.g., ‘harmlessness’) when that concept C computationally significant in the system’s decision-making.
Concepts are not themselves states of the world (e.g., ‘dog’ is a concept, but doesn’t describe a state of the world). Instead, I think of concepts (like ‘dog’ or ‘harmlessness’) as something like a schema (or algorithm) for classifying possible inputs according to their C-ness (e.g., an algorithm for classifying possible inputs asdogs, or classifying possible inputs as involving ‘harmful’ actions).
With these definitions in mind, I want to say:
μH has ‘harmlessness’ as a context-independent value, because the learned concept of ‘harmlessness’ consistently shapes the policy’s behavior across a range of contexts (e.g., by influencing things like the generation of its feasible option set).
However, μH needn’t have a context-independent outcome-preference for O∗ = “my actions don’t cause significant harm”, because it may not explicitly represent O∗ as a possible state of affairs across all contexts.
For example, the ‘harmlessness’ concept could be computationally significant in shaping the feasible option set or the granularity of outcome representations, without ever explicitly representing ‘the world is in a state where my actions are harmless’ as a discrete outcome to be pursued.
I struggled to make this totally explicit, but I’ll offer a speculative below of how μH’s cognition might work without CP.
Justification Part II: Decision-Making Without CP
I’ll start by stealing an old diagram from the shard theory discord server (cf. cf0ster). My description is closest to the picture of Agent Design B, and I’ll make free use of ‘shards’ to refer to ‘decision-influences’.
So, here’s how μH’s cognition might look in the absence of CP:
μH takes in some input request.
E.g., suppose it receives an input from someone claiming to be a child, who is looking for help debugging her code.
Together, the input and μH’s learned concepts together generate a mental context.
The policy’s mental context is a cognitive description of the state of the total network. In this example, μH’s mental context might be: “Human child has just given me a coding problem” (though it could ofc be more complicated).
The mental contextactivates a set of ‘shards’ (or decision-influences).
In this example, the policy might have a “solve coding problem” shard, and a “be considerate” shard.
Activated shards ‘bid’ for actions with certain properties.
E.g., “pro-gentle shard” influences decision-making by bringing encouraging thoughts to mind, “pro-code-solving shard” influences decision-making by generating thoughts like “check for common code error #5390”.
Bids from shards generate an initial ‘option set’: this is a set of actions that meet the properties bid for by previously activated shards.
These might be actions like “check for common error #5390, then present corrected code to the child, alongside encouraging words”, alongside considerations like “ensure response is targeted”, “ensure response is considerate”.
Mental context “I’m presented with a set of actions” activates the “planning shard”, which selects an action based on contextually-generated considerations.
E.g., plans might be assessed against some kind of (weighted) vote count of activated shards.
The weighted vote count generates preferences over the salient outcomes caused by actions in the set.
μH performs the action.
I don’t want to say “future AGI cognition will be well-modeled using Steps 1-7”. And there’s still a fair amount of imprecision in the picture I suggest. Still, I do think it’s a coherent picture of how the learned concept ‘harmlessness’ consistently plays a causal role in μH’s behavior, without assuming consequentialist preferences.
(I expect you’ll still have some issues with this picture, but I can’t currently predict why/how)
Thanks! Once again this is great. I think it’s really valuable for people to start theorizing/hypothesizing about what the internal structure of AGI cognition (and human cognition!) might be like at this level of specificity.
Thinking step by step:
My initial concern is that there might be a bit of a dilemma: Either (a) the cognition is in-all-or-most-contexts-thinking-about-future-world-states-in-which-harm-doesn’t-happen in some sense, or (b) it isn’t fair to describe it as harmlessness. Let me look more closely at what you said and see if this holds up.
However, μH needn’t have a context-independent outcome-preference for O∗ = “my actions don’t cause significant harm”, because it may not explicitly represent O∗ as a possible state of affairs across all contexts.
For example, the ‘harmlessness’ concept could be computationally significant in shaping the feasible option set or the granularity of outcome representations, without ever explicitly representing ‘the world is in a state where my actions are harmless’ as a discrete outcome to be pursued.
In the example, the ‘harmlessness’ concept shapes the feasible option set, let’s say. But I feel like there isn’t an important difference between ‘concept X is applied to a set of options to prune away some of them that trigger concept X too much (or not enough)’ and ‘concept X is applied to the option-generating machinery in such a way that reliably ensures that no options that trigger concept X too much (or not enough) will be generated.’ Either way, it seems like it’s fair to say that the system (dis)prefers X. And when X is inherently about some future state of the world—such as whether or not harm has occurred—then it seems like something consequentialist is happening. At least that’s how it seems to me. Maybe it’s not helpful to argue about how to apply words—whether the above is ‘fair to say’ for example—and more fruitful to ask: What is your training goal? Presented with a training goal (“This should be a mechanistic description of the desired model that explains how you want it to work—e.g. “classify cats using human vision heuristics”—not just what you want it to do—e.g. “classify cats.”), we can then argue about training rationale (i.e. whether the training environment will result in the training goal being achieved.)
You’ve said a decent amount about this already—your ‘training goal’ so to speak is a system which may frequently think about the consequnces of its actions and choose actions on that basis, but for which the ‘final goals’ / ‘utility function’ / ‘preferences’ with which it uses to pick actions are not context-indepeendent but rather highly context-dependent. It’s thus not a coherent agent, so to speak; it’s not consistently pushing the world in any particular direction on purpose, but rather flitting from goal to goal depending on the situation—and the part of it that determines what goal to flit to is NOT itself well-described as goal-directed, but rather something more like a look-up-table that has been shaped by experience to result in decent performance. (Or maybe you’d say it might indeed look goal-directed but only for myopic goals, i.e. just focused on performance in a particular limited episode?)
(And thus, you go on to argue, it won’t result in deceptive alignment or reward-seeking behavior. Right?)
I fear I may be misunderstanding you so if you want to clarify what I got wrong about the above that would be helpful!
Isn’t “Harmlessness” an example of CP? If the model is truly Harmless that means it is thinking about how to avoid causing harm to people, and that this thinking isn’t limited to specific prompts but rather is baked in to its behavior more generally.
I don’t think so. Suppose Alex is an AI in training, and Alex endorses the value of behaving “harmlessly”. Then, I think the following claims are true of Alex:
Alex consistently cares about producing actions that meet a given criteria. So, Alex has some context-independent values.
On plausible operationalizations of ‘harmlessness’, Alex is also likely to possess, at given points in time, context-dependent, beyond-episode outcome-preferences. When Alex considers which actions to take (based on harmlessness), their actions are (in part) determined by what states of the world are likely to arise after their current training episode is over.
That said, I don’t think Alex needs to have consequentialist preferences. There doesn’t need to be some specific state of the world that they’re pursuing at all points in time.
To elaborate: this view says that “harmlessness” acts as something akin to a context-independent filter over possible (trajectory, outcome) pairs. Given some instruction, at a given point in time, Alex forms some context-dependent outcome-preferences.
That is, one action-selection criteria might be ‘choose an action which best satisfies my consequentialist preferences’. Another action-selection criteria might be: ‘follow instructions, given (e.g.) harmlessness constraints’.
The latter criterion can be context-independent, while only generating ‘consequentialist preferences’ in a context-dependent manner.
So, when Alex isn’t provided with instructions, they needn’t be well-modeled as possessing any outcome-preferences. I don’t think that a model which meets a minimal form of behavioral consistency (e.g., consistently avoiding harmful outputs) is enough to get you consequentialist preferences.
Earlier you said:
Now you are saying that if Alex does end up Harmless as we hoped, it will have context-independent values, and also it will have context dependent beyond episode outcome-preferences, but it won’t have context-independent beyond-episode outcome-preferences? It won’t have “some specific state of the world” that it’s pursuing at all points in time?
First of all, I didn’t think CP depended on there being a specific state of the world you were aiming for. (what does that mean anyway?) It just meant you had some context-independent beyond-episode outcome-preferences (and that you plan towards them). Seems to me that ‘harmlessness’ = ‘my actions don’t cause significant harm’ (which is an outcome-preference not limited to the current episode) and it seems to me that this is also context-independent because it is baked into Alex via lots of training rather than just something Alex sees in a prompt sometime.
I have other bigger objections to your arguments but this one is the one that’s easiest to express right now. Thanks for writing this post btw it seems to me to be a more serious and high-quality critique of the orthodox view than e.g. Quintin & Nora’s stuff.
Hmmm … yeah, I think noting my ambiguity about ‘values’ and ‘outcome-preferences’ is good pushback —thanks for helping me catch this! Spent some time trying to work out what I think.
Ultimately, I do want to say μH has context-independent values, but not context-independent outcome preferences. I’ll try to specify this a little more.
Justification Part I: Definitions
I said that a policy has preferences over outcomes when “there are states of the world the policy finds more or less valuable … ”, but I didn’t specify what it means to find states of the world more or less “valuable”. I’ll now say that a system (dis)values some state of the world O when:
It has an explicit representation of O as a possible state of the world, and
The prospect of the system’s outputs resulting in O is computationally significant in the system’s decision-making.
So, a system a context-independent outcome-preference for a state of the world O if the system has an outcome-preference for O across all contexts. I think reward maximization and deceptive alignment require such preferences. I’ll also define what it means to value a concept.
Concepts are not themselves states of the world (e.g., ‘dog’ is a concept, but doesn’t describe a state of the world). Instead, I think of concepts (like ‘dog’ or ‘harmlessness’) as something like a schema (or algorithm) for classifying possible inputs according to their C-ness (e.g., an algorithm for classifying possible inputs as dogs, or classifying possible inputs as involving ‘harmful’ actions).
With these definitions in mind, I want to say:
μH has ‘harmlessness’ as a context-independent value, because the learned concept of ‘harmlessness’ consistently shapes the policy’s behavior across a range of contexts (e.g., by influencing things like the generation of its feasible option set).
However, μH needn’t have a context-independent outcome-preference for O∗ = “my actions don’t cause significant harm”, because it may not explicitly represent O∗ as a possible state of affairs across all contexts.
For example, the ‘harmlessness’ concept could be computationally significant in shaping the feasible option set or the granularity of outcome representations, without ever explicitly representing ‘the world is in a state where my actions are harmless’ as a discrete outcome to be pursued.
I struggled to make this totally explicit, but I’ll offer a speculative below of how μH’s cognition might work without CP.
Justification Part II: Decision-Making Without CP
I’ll start by stealing an old diagram from the shard theory discord server (cf. cf0ster). My description is closest to the picture of Agent Design B, and I’ll make free use of ‘shards’ to refer to ‘decision-influences’.
So, here’s how μH’s cognition might look in the absence of CP:
μH takes in some input request.
E.g., suppose it receives an input from someone claiming to be a child, who is looking for help debugging her code.
Together, the input and μH’s learned concepts together generate a mental context.
The policy’s mental context is a cognitive description of the state of the total network. In this example, μH’s mental context might be: “Human child has just given me a coding problem” (though it could ofc be more complicated).
The mental context activates a set of ‘shards’ (or decision-influences).
In this example, the policy might have a “solve coding problem” shard, and a “be considerate” shard.
Activated shards ‘bid’ for actions with certain properties.
E.g., “pro-gentle shard” influences decision-making by bringing encouraging thoughts to mind, “pro-code-solving shard” influences decision-making by generating thoughts like “check for common code error #5390”.
Bids from shards generate an initial ‘option set’: this is a set of actions that meet the properties bid for by previously activated shards.
These might be actions like “check for common error #5390, then present corrected code to the child, alongside encouraging words”, alongside considerations like “ensure response is targeted”, “ensure response is considerate”.
Mental context “I’m presented with a set of actions” activates the “planning shard”, which selects an action based on contextually-generated considerations.
E.g., plans might be assessed against some kind of (weighted) vote count of activated shards.
The weighted vote count generates preferences over the salient outcomes caused by actions in the set.
μH performs the action.
I don’t want to say “future AGI cognition will be well-modeled using Steps 1-7”. And there’s still a fair amount of imprecision in the picture I suggest. Still, I do think it’s a coherent picture of how the learned concept ‘harmlessness’ consistently plays a causal role in μH’s behavior, without assuming consequentialist preferences.
(I expect you’ll still have some issues with this picture, but I can’t currently predict why/how)
Thanks! Once again this is great. I think it’s really valuable for people to start theorizing/hypothesizing about what the internal structure of AGI cognition (and human cognition!) might be like at this level of specificity.
Thinking step by step:
My initial concern is that there might be a bit of a dilemma: Either (a) the cognition is in-all-or-most-contexts-thinking-about-future-world-states-in-which-harm-doesn’t-happen in some sense, or (b) it isn’t fair to describe it as harmlessness. Let me look more closely at what you said and see if this holds up.
In the example, the ‘harmlessness’ concept shapes the feasible option set, let’s say. But I feel like there isn’t an important difference between ‘concept X is applied to a set of options to prune away some of them that trigger concept X too much (or not enough)’ and ‘concept X is applied to the option-generating machinery in such a way that reliably ensures that no options that trigger concept X too much (or not enough) will be generated.’ Either way, it seems like it’s fair to say that the system (dis)prefers X. And when X is inherently about some future state of the world—such as whether or not harm has occurred—then it seems like something consequentialist is happening. At least that’s how it seems to me. Maybe it’s not helpful to argue about how to apply words—whether the above is ‘fair to say’ for example—and more fruitful to ask: What is your training goal? Presented with a training goal (“This should be a mechanistic description of the desired model that explains how you want it to work—e.g. “classify cats using human vision heuristics”—not just what you want it to do—e.g. “classify cats.”), we can then argue about training rationale (i.e. whether the training environment will result in the training goal being achieved.)
You’ve said a decent amount about this already—your ‘training goal’ so to speak is a system which may frequently think about the consequnces of its actions and choose actions on that basis, but for which the ‘final goals’ / ‘utility function’ / ‘preferences’ with which it uses to pick actions are not context-indepeendent but rather highly context-dependent. It’s thus not a coherent agent, so to speak; it’s not consistently pushing the world in any particular direction on purpose, but rather flitting from goal to goal depending on the situation—and the part of it that determines what goal to flit to is NOT itself well-described as goal-directed, but rather something more like a look-up-table that has been shaped by experience to result in decent performance. (Or maybe you’d say it might indeed look goal-directed but only for myopic goals, i.e. just focused on performance in a particular limited episode?)
(And thus, you go on to argue, it won’t result in deceptive alignment or reward-seeking behavior. Right?)
I fear I may be misunderstanding you so if you want to clarify what I got wrong about the above that would be helpful!