Past Account comments on Developmental Stages of GPTs

Past Account 16 Aug 2020 13:48 UTC
LW: 3 AF: 2
0
AF
[Deleted]
- TurnTrout 4 Dec 2020 2:54 UTC
  LW: 4 AF: 2
  0
  AF Parent
  What is the formal definition of ‘power seeking’?
  The freshly updated paper answers this question in great detail; see section 6 and also appendix B.
- TurnTrout 16 Aug 2020 18:41 UTC
  LW: 2 AF: 1
  0
  AF Parent
  What is the formal definition of ‘power seeking’?
  Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other.
  Power is kinda weird when defined for optimal agents, as you say—when $γ = 1$ , POWER can only decrease. See Power as Easily Exploitable Opportunities for more on this.
  My understanding of figure 7 of your paper indicates that cycle reachability cannot be a sufficient condition.
  Shortly after Theorem 19, the paper says: “In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7”. In particular, see Figure 19.
  The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.
  If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn’t in the main part of the paper, but basically you toss out transitions which aren’t part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.
  The main idea, though, is that you’re reasoning about what the agent’s end goals tend to be, and then say “it’s going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)”. Theorem 17 tells us that in the limit, cycle reachability totally controls POWER.
  I think I still haven’t clearly communicated all my mental models here, but I figured I’d write a reply now while I update the paper.
  Thank you for these comments, by the way. You’re pointing out important underspecifications. :)
  My philosophy is that aligned/general is OK based on a shared (?) premise that,
  I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.