Wei Dai comments on Towards a mechanistic understanding of corrigibility

Wei Dai 29 Sep 2019 21:00 UTC
LW: 6 AF: 5
AF

One possible candidate property that Paul has proposed is act-based corrigibility, wherein an agent respects our short-term preferences, including those over how the agent itself should be modified.

Similar to my (now largely resolved) confusion about how Paul uses “corrigibility”, I also have a confusion about how “corrigibility” is used here. In particular, is “act-based corrigibility” synonymous with “respects our short-term preferences” (and if so do you mean “preferences-on-reflection”) or is it a different property (i.e., corrigibility_MIRI or a broader version of that) that you think “an agent respects our short-term preferences” is likely to have? It seems to me from context that you mean the former (synonymous), because earlier you wrote:

what is the easiest-to-train-and-verify property such that all models that satisfy that property[1] (and achieve high average reward) are safe?

and “respects our short-term preferences” seems to be the “candidate property” that you’re naming “act-based corrigibility” because it’s a “mechanistic” property that might be easy to train and verify whereas corrigibility in the MIRI sense (or my current understanding of Paul’s sense) does not seem mechanistic or easy to verify.

Can you please confirm whether my guess of your original intended meaning is correct? And if it is, please consider changing your wording here (or in the future) to be more consistent with Paul’s clarification of how he uses “corrigibility”?
- evhub 29 Sep 2019 23:31 UTC
  LW: 4 AF: 3
  AF Parent
  Part of the point that I was trying to make in this post is that I’m somewhat dissatisfied with many of the existing definitions and treatments of corrigibility, as I feel like they don’t give enough of a basis for actually verifying them. So I can’t really give you a definition of act-based corrigibility that I’d be happy with, as I don’t think there currently exists such a definition.
  
  That being said, I think there is something real in the act-based corrigibility cluster, which (as I describe in the post) I think looks something like corrigible alignment in terms of having some pointer to what the human wants (not in a perfectly reflective way, but just in terms of actually trying to help the human) combined with some sort of pre-prior creating an incentive to improve that pointer.
  - Wei Dai 30 Sep 2019 16:23 UTC
    LW: 14 AF: 7
    AF Parent
    I thought Evan’s response was missing my point (that “act-based corrigibility” as used in OP doesn’t seem to be a kind of corrigibility as defined in the original corrigibility paper but just a way to achieve corrigibility) and had a chat with Evan about this on MIRIxDiscord (with Abram joining in). It turns out that by “act-based corrigibility” Evan meant both “a way of achieving something in the corrigibility cluster [by using act-based agents] as well as the particular thing in that cluster that you achieve if you actually get act-based corrigibility to work.”
    
    The three of us talked a bit about finding better terms for these concepts but didn’t come up with any good candidates. My current position is that using “act-based corrigibility” this way is quite confusing and until we come up with better terms we should probably just stick with “achieving corrigibility using act-based agents” and “the kind of corrigibility that act-based agents may be able to achieve” depending on which concept one wants to refer to.
    What links here?
    List of resolved confusions about IDA by Wei Dai (30 Sep 2019 20:03 UTC; 97 points)
    Paying the corrigibility tax by Max H (19 Apr 2023 1:57 UTC; 14 points)
    - Wei Dai 1 Oct 2019 9:58 UTC
      15 points
      Parent
      
      My current position is that using “act-based corrigibility” this way is quite confusing and until we come up with better terms we should probably just stick with “achieving corrigibility using act-based agents” and “the kind of corrigibility that act-based agents may be able to achieve” depending on which concept one wants to refer to.
      
      Now that I’ve lived with understanding what Evan meant by “act-based corrigibility” for a while, I find that I’m having trouble holding on to my initial feeling of “this is likely to cause confusion to people”, despite consciously trying to, and it’s starting to feel more and more reasonable to use it the way Evan did. It seems like an interesting and revealing case of the illusion of transparency in action.