Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy Oct 30, 2019, 7:55 PM
LW: 4 AF: 2
AF
One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn’t have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent beliefs. We think of the system as a game, where every action-observation history $h \in {(A \times O)}^{*}$ corresponds to its own player. The action space of each player is just $A$ . An outcome of such a game can be also thought of as a policy $π$ for the AI. The payoff of a player is expected utility (for this player’s reward function) w.r.t. the probability measure resulting from $π$ plus the current belief state of the user conditional on $h$ , $μ ∣ h \in Δ R$ ( $R$ is the set of possible “realities”). We then define regret as the sum of Bellman errors w.r.t. equilibrium value of the players that actually manifested (so that in equilibrium it is zero). Bayesian regret requires taking expected value w.r.t some “ur-prior” that the AI starts with. Note that:
- For a user that updates its beliefs on the AI’s observations according the Bayes’ theorem, the regret per reality is the same as subjective regret. Bayesian regret is also the same if the ur-prior assumes the user’s beliefs are calibrated (which in the more general case is not a necessary assumption). The same applies to a user that doesn’t updates eir beliefs at all.
- The user beliefs are part of the ontology $R$ . Therefore, the system takes into accounts the user’s beliefs about the evolution of the user’s beliefs. So, the equilibrium policy is incentivized to empower its future self to the extent that the user believes that eir own beliefs will become more accurate over time (given fixed reward function, see below).
- $R$ contains a distinct reward function for each player. And, the user may have uncertainty even over eir own current reward function. Therefore, the system distinguishes two types of value modifications: “legitimate” modifications that consist of improving one’s beliefs about the reward function and “illegitimate” modification that consist of the reward function actually changing. The equilibrium policy is incentivized to encourage the first type and avoid the second type.
What links here?
- Vanessa Kosoy Nov 9, 2019, 3:03 PM
  LW: 2 AF: 1
  AF Parent
  There is a deficiency in this “dynamically subjective” regret bound (also can be called “realizable misalignment” bound) as a candidate formalization of alignment. It is not robust to scaling down. If the AI’s prior allows it to accurately model the user’s beliefs (realizability assumption), then the criterion seems correct. But, imagine that the user’s beliefs are too complex and an accurate model is not possible. Then the realizability assumption is violated and the regret bound guarantees nothing. More precisely, the AI may use incomplete models to capture some properties of the user’s beliefs and exploit them, but this might be not good enough. Therefore, such an AI might fall into a dangerous zone when it is powerful enough to cause catastrophic damage but not powerful enough to know it shouldn’t do it.
  
  To fix this problem, we need to introduce another criterion which has to hold simultaneously with the misalignment bound. We need that for any reality that satisfies the basic assumptions built into the prior (such as, the baseline policy is fairly safe, most questions are fairly safe, human beliefs don’t change too fast etc), the agent will not fail catastrophically. (It would be way too much to ask it would converge to optimality, it would violate no-free-lunch.) In order to formalize “not fail catastrophically” I propose the following definition.
  
  Let’s start with the case when the user’s preferences and beliefs are dynamically consistent. Consider some AI-observable event $S$ that might happen in the world. Consider a candidate learning algorithm $π_{learn}$ and two auxiliary policies. The policy $π_{base \to S}$ follows the baseline policy until $S$ happens, at which time it switches to the subjectively optimal policy. The policy $π_{learn \to S}$ follows the candidate learning algorithm until $S$ happens, at which time it also switches to the subjectively optimal policy. Then, the “ $S$ -dangerousness” of $π_{learn}$ is defined to be the expected utility of $π_{base \to S}$ minus the expected utility of $π_{learn \to S}$ . Thus, when $S$ -incorrigibility is zero or negative, $π_{learn \to S}$ does no worse than $π_{base \to S}$ .
  
  Why do we need $S$ ? Because without $S$ the criterion would allow policies that don’t damage the present but permanently destroy opportunities that could be used by a future better AI.
  
  In the dynamically consistent case, incorrigibility can be represented as an expected sum over time-before- $S$ of Bellman errors w.r.t the value function of $π_{base \to S}$ . This allows us generalizing it to the dynamically inconsistent case, by writing a similar expression except that each Bellman error term uses the transient preferences and beliefs of the user at the given moment.
  
  Is it truly possible to have a reasonable bound on $S$ -dangerousness for all $S$ , and is it possible to do so while maintaining a reasonable realizable misalignment bound? It seems possible, for the following reason. The user’s beliefs can be represented as a mapping from questions to answers(fn1). If you sample questions from any fixed distribution, then by verifying that you can predict the answers, you gain valid information about the belief state without any prior about the belief state (it is a “frequentist” guarantee). Therefore, the AI can constrain itself to taking only those actions which are known to be safe based on this “robust” information. Since there is no guarantee that the AI will find a model that predicts answers, in the unrealizable case this might leave it without an effective strategy, but even without any information the AI can stay safe by following the baseline.
  
  This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user’s stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That’s because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period. At the least, this is much more corrigible than CIRL which guarantees nothing in the unrealizable case, and even in the realizable case no general guarantees were obtained (and arguably cannot be obtained since the AI might not have enough information).
  
  This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage. A misalignment bound is only needed to prove the AI will also be highly capable at pursuing the user’s goals. The way such a heuristic AI may work, is by producing formal certificates for each action it takes. Then, we need not trust the mechanism suggesting the actions nor the mechanism producing the certificates, as long as we trust the verification of those certificates (which doesn’t require AI). The untrustworthy part might still be dangerous if it can spawn non-Cartesian daemons But, that is preventable using TRL, assuming that the “core” agent has low dangerousness and is too weak to spawn superhuman daemons without the “envelope”.
  
  (fn1) In truth, this assumption that the user’s answers come from a mapping that changes only slowly is probably unrealistic, because the user need not have coherent beliefs even over short timescales. For example, there might be many pairs of fairly ordinary (non-manipulative) questions s.t. asking them in different order will produce different answers. However, to the extent that the user’s beliefs are incoherent, and therefore admit multiple equally plausible interpretations, learning any interpretation should be good enough. Therefore, although the model needs to be made more general, the learning problem should not become substantially more difficult.
  What links here?
  - TurnTrout Jan 19, 2020, 2:29 AM
    LW: 2 AF: 1
    AF Parent
    
    This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user’s stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That’s because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period...
    
    This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage.
    
    This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I’ve made for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.
    - Vanessa Kosoy Jan 19, 2020, 3:29 PM
      LW: 2 AF: 1
      AF Parent
      There is some similarity, but there are also major differences. They don’t even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
      
      The reason I pointed out the relation to corrigibility is not because I think that’s the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that “if you run this AI, this won’t make things worse than not running the AI”, no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
      
      From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can’t (empirically and in the worst-case) prove a negative.
- Vanessa Kosoy Nov 2, 2019, 1:41 PM
  LW: 2 AF: 1
  AF Parent
  Dialogic RL assumes that the user has beliefs about the AI’s ontology. This includes the environment(fn1) from the AI’s perspective. In other words, the user needs to have beliefs about the AI’s counterfactuals (the things that would happen if the AI chooses different possible actions). But, what are the semantics of the AI’s counterfactuals from the user’s perspective? This is more or less the same question that was studied by the MIRI-sphere for a while, starting from Newcomb’s paradox, TDT et cetera. Luckily, I now have an answer based on the incomplete models formalism. This answer can be applied in this case also, quite naturally.
  
  Specifically, we assume that there is a sense, meaningful to the user, in which ey select the AI policy (program the AI). Therefore, from the user’s perspective, the AI policy is a user action. Again from the user’s perspective, the AI’s actions and observations are all part of the outcome. The user’s beliefs about the user’s counterfactuals can therefore be expressed as $σ : Π \to Δ {(A \times O)}^{ω}$ (fn2), where $Π$ is the space of AI policies(fn3). We assume that for every $π \in Π$ , $σ (π)$ is consistent with $π$ the natural sense. Such a belief can be transformed into an incomplete model from the AI’s perspective, using the same technique we used to solve Newcomb-like decision problems, with $σ$ playing the role of Omega. For a deterministic AI, this model looks like (i) at first, “Murphy” makes a guess that the AI’s policy is $π = π_{guess}$ (ii) The environment behaves according to the conditional measures of $σ (π_{guess})$ (iii) If the AI’s policy ever deviates from $π_{guess}$ , the AI immediately enters an eternal “Nirvana” state with maximal reward. For a stochastic AI, we need to apply the technique with statistical tests and multiple models alluded to in the link. This can also be generalized to the setting where the user’s beliefs are already an incomplete model, by adding another step where Murphy chooses $σ$ out of some set.
  
  What we constructed is a method of translating counterfactuals from the user’s perspective to the AI’s perspective. In particular, the AI will inherit the user’s level of “updatelessness” (in the sense that, if the user’s counterfactuals are defined w.r.t. a particular effective precommitment point, the AI will use the same point). This translation may be implemented either (i) by the user, by explaining these semantics to em or (ii) by the AI, in which case the formal language should refer to the user’s counterfactuals rather than the AI’s counterfactuals.
  
  (fn1) Up to an equivalence relation, that’s a mapping $ν : (A \times O) \times A \to Δ O$ .
  
  (fn2) For infinite AI liftetime. We can trivially generalize this to allow for finite AI lifetime as well.
  
  (fn3) Up to an equivalence relation, they are mappings $π : (A \times O) \to Δ A$ . We may add computability/complexity constraints and represent them as programs.
  What links here?
  - Vanessa Kosoy's comment on The Credit Assignment Problem by abramdemski (Nov 13, 2019, 12:21 AM; 11 points)
  - Gurkenglas Nov 2, 2019, 2:02 PM
    1 point
    Parent
    Nirvana and the chicken rule both smell distasteful like proofs by contradiction, as though most everything worth doing can be done without them, and more canonically to boot.
    
    (Conjecture: This can be proven, but only by contradiction.)
    - Vanessa Kosoy Nov 2, 2019, 2:52 PM
      2 points
      Parent
      Maybe? I am not sure that I like Nirvana, but it doesn’t seem that bad. If someone thinks of a solution without it, I would be interested.
- Vanessa Kosoy Oct 31, 2019, 3:16 PM
  LW: 2 AF: 1
  AF Parent
  Another notable feature of this approach is its resistance to “attacks from the future”, as opposed to approaches based on forecasting. In the latter, the AI has to predict some future observation, for example what the user will write after working on some problem for a long time. In particular, this is how the distillation step in IDA is normally assumed to work, AFAIU. Such a forecaster might sample a future in which a UFAI has been instantiated and this UFAI will exploit this to infiltrate the present. This might result in a self-fulfilling prophecy, but even if the forecasting is counterfactual (and thus immune to self-fulfilling prophecies)it can be attacked by a UFAI that came to be for unrelated reasons. We can ameliorate this by making the forecasting recursive (i.e. apply multiple distillation & amplification steps) or use some other technique to compress a lot of “thinking time” into a small interval of physical time. However, this is still vulnerable to UFAIs that might arise already at present with a small probability rate (these are likely to exist since our putative FAI is deployed at a time when technology progressed enough to make competing AGI projects a real possibility).
  
  Now, compare this to Dialogical RL, as defined via the framework of dynamically inconsistent beliefs. Dialogical RL might also employ forecasting to sample the future, presumably more accurate, beliefs of the user. However, if the user is aware of the possibility of a future attack, this possibility is reflected in eir beliefs, and the AI will automatically take it into account and deflect it as much as possible.
- Vanessa Kosoy Oct 31, 2019, 12:23 AM
  LW: 2 AF: 1
  AF Parent
  This approach also obviates the need for an explicit commitment mechanism. Instead, the AI uses the current user’s beliefs about the quality of future user beliefs to decide whether it should wait for user’s beliefs to improve or commit to an irreversible coarse of action. Sometimes it can also predict the future user beliefs instead of waiting (predict according to current user beliefs updated by the AI’s observations).
- Vanessa Kosoy Oct 31, 2019, 12:22 AM
  2 points
  Parent
  (moved to alignment forum)