Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy Dec 9, 2021, 1:30 PM
LW: 3 AF: 3
AF
There’s a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD, which we call autocalibrating quantilized RL (AQRL).

First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn’t contain terms such as “oh btw don’t kill people while you’re building the nanosystem”. However, suppose the task is s.t. accomplishing it in the intended way (without Goodharting or causing catastrophic side effects) is easier than performing any attack. We will call this the “relative difficulty assumption” (RDA). Then, there exists a value for the quantilization parameter s.t. quantilized RL performs the task in the intended way.

We might not know how to set the quantilization parameter on our own, but we can define a performance goal for the task (in terms of expected total reward) s.t. the RDA holds. This leads to algorithms which gradually tune the quantilization parameter until the performance goal is met, while maintaining a proper balance between safety and sample complexity. Here it is important to keep track of epistemic vs. aleatoric uncertainty: the performance goal is the expectation of total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a given hypothesis), whereas the safety goal is a bound on the expected cost of overshooting the optimal quantilization parameter relatively to both aleatoric and epistemic uncertainty (i.e. uncertainty between different hypotheses). This secures the system against malign hypotheses that are trying to cause an overshoot.

Notice the hardening the system (i.e. making it more secure using conventional methods) increases the likelihood of the RDA. That is, although it is infeasible to manually harden the system against superintelligent attacks, it can very well be feasible to harden it against attacks available at the minimal level of optimization power (= quantilization parameter) required to meet the performance goal, at least to such an extent that the intended solution becomes more appealing for the AI.

AQRL involves the risk that the RDA is false, even though it seems true to the designers. To mitigate this risk, we can deploy honeypots. That is, we add auxiliary goals to our reward function s.t. (i) if the goal is reached, maximal total reward is awarded (ii) the goal is harmless in itself (iii) reaching the goal is still harder than the intended solution, but by a smaller margin than performing attacks (iv) we can detect the event of reaching the goal. For example, suppose that the AI is not supposed to have Internet access, so gaining Internet access is a path to a variety of possible attacks. Then, we can add an auxiliary goal of the form “send such-and-such innocuous packet to such-and-such IP address”. If the RDA is false and the AI succeeds gaining Internet access earlier (on the quantilization parameter scale) than reaching the performance goal, once the AI sent the packet it will randomize all its remaining actions (since it is operating at the earliest quantilization parameter which allows this). The users will then be alerted that security was breached and rethink their next steps.

Now, suppose that we can only specify the reward function via concepts that have to be learned from examples. This opens the risk of generalization failure. Such risk is not necessarily a problem since we are still protected by quantilization. However, it needs to be factored into our safety bounds. An especially concerning possibility is that the AI will interpret the concepts as “would be labeled by humans as such-and-such” instead of the intended interpretation, creating incentives to deceive. Moreover, this encourages learning human models which puts more pressure on the RDA by making it easier to attack.

In order to avoid learning human models, we can use methods along these lines. Specifically, suppose that, in addition to the dataset for learning out task-relevant concepts, we supply a dataset designed specifically for learning human models. We can then design an algorithm that produces models of the task-relevant concepts s.t. the human-centric dataset cannot be compressed w.r.t. them. That is, the algorithm makes sure that access to the models it produces doesn’t help with learning the human-centric dataset, implying that the concept models don’t contain human models. Moreover, we can extend this approach to the entire process, s.t. human models are not learned from any other source of information as well.

Notice that although this method excludes human models from the actual RL process, it encourages them in intermediate computations, since we are actively supplying a human-centric dataset to the system. But, this is not dangerous if the intermediate computation is hardened against non-Cartesian daemons (a big “if” but one we need to deal with anyway).
What links here?
- The Learning-Theoretic Agenda: Status 2023 by Vanessa Kosoy (Apr 19, 2023, 5:21 AM; 143 points)
- Vanessa Kosoy's comment on Six Dimensions of Operational Adequacy in AGI Projects by Eliezer Yudkowsky (Jun 2, 2022, 9:44 AM; 16 points)