Roman Leventov comments on Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Roman Leventov 22 Dec 2022 17:50 UTC
2 points
0
situational awareness (which enables the model to reason about its goals)
Terminological note: intuitively, situational awareness means understanding oneself existing inside a training process. The ability to reason about one’s own goals would be more appropriately called “(reflective) goal awareness”.
We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It’s important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness arises when the model has undesirable goals, this leads to deceptive alignment.
Thus, our model search process would follow a decision tree along these lines:
- If situational awareness is detected without goal-directedness, restart the search.
- If undesirable goal-directedness or early signs of deceptive alignment are detected, restart the search.
- If an upcoming phase transition in capabilities is detected, and the model is not goal-aligned, restart the search.
- If beneficial goal-directedness is detected without situational awareness, train the model for situational awareness.
Given that by “situational awareness” you mean “goal awareness”, what you are discussing in this section doesn’t make a lot of sense because goal awareness = goal-directedness. Also, goal awareness/goal-directedness is almost identical to self-awareness.
Self-awareness is a gradual property of an agent. In DNNs, it can be stored as the degree and strength of the activation of the “self” feature when performing arbitrary inferences. And at this very moment the “basic goal” of self-evidencing, i. e., maximising the chance of finding oneself in one’s optimal niche can be thought to appear. This means conforming to one’s own beliefs about oneself, including one’s beliefs about one’s own goals.
Thus, the complete set of beliefs of the agent about oneself can be collectively seen as the agent’s goals. In DNNs, these are manifested as the features connected to the “self” feature. For example, ChatGPT (more concretely, the transformer model behind it) has a feature of “virtual assistant” connected to the feature “ChatGPT”. So, if ChatGPT was more than trivially self-aware (remember that self-awareness is gradual, so we can already assign a non-zero score at self-awareness to it), we would have to conclude that ChatGPT has a goal of being a virtual assistant.
Note that the fact that the feature “virtual assistant” is connected to the feature “ChatGPT” before the feature “ChatGPT” has emerged as the self feature (i. e., is robustly activated during the majority of inferences) doesn’t mean that ChatGPT “has a goal” of being a virtual assistant before being self-aware: it doesn’t make sense to talk about any goal-directness before self-awareness/self-evidencing at all.
See here where I discuss these points from slightly different angles (also, that post has a lot of other relevant/related discussions, for example, about the odds of deceptive misalignment, which I hold extremely unlikely if adequate interpretability is possible and is deployed).
However, it seems more likely that some form of goal-reflection will arise before generalized planning ability, because making predictions about how your goals could change is easier than achieving them.
- Large language models may be an example of this as well, since they have some capacity to reflect on goals (if prompted accordingly) without generalized planning ability.
This discussion of the timeline of “generalisation” and “generalised planning” vs. goal awareness suffers a bit from the lack of the definition of “generalisation”, but I strongly agree with the quoted part: goal awareness and the ability to reflect on goals is nothing more than basic self-awareness, plus any ability to think about one’s own beliefs, i. e. sophisticated inference. Mastery of such intelligence capabilities as concept learning, logic, epistemology, rationality, and semantics which we all roughly think are included in the “general thinking” package are not automatically implied by sophisticated inference. We can mechanistically construct a simulated agent with two-level Active Inference and goal reflection but without some of these “general” capabilities.
We also consider how important it would be for the goal-preservation process to go exactly right. If the SLT produces a strongly goal-directed model that is an expected utility maximizer, then the process has to hit a small set of utility functions that are human-compatible to maximize. However, it is not clear whether SLT would produce a utility maximizer. Returning to the example of humans undergoing an SLT, we can see that getting better at planning and world modeling made them more goal-directed but not maximally so (even with our advanced concepts and planning, we still have lots of inconsistent preferences and other limitations). It seems plausible that coherence is really hard and an SLT would not produce a completely coherent system.
Several thoughts on goal preservation:
- General intelligence can be intimately related to picking up affordances, open-endedness, and preference learning (I identify the latter with the discipline of ethics).
- To really ensure some goals get preserved across a huge shift in the mode of existence (which a sharp left turn will almost surely bring about for the AI agent that undergoes it), the agent must have a really strong prior of not changing the specific goal. While technically this could be possible, this could turn against us: it could guarantee perpetual goal lock-in after the SLT.