Seth Herd comments on [Aspiration-based designs] 1. Informal introduction

Seth Herd 28 Apr 2024 21:38 UTC
7 points
1
I applaud the effort. Big upvote for actually trying to solve the problem, by coming up with a way to create safe, aligned AGI. If only more people were doing this instead of hand wringing, arguing, or “working on the problem” in poorly-thought-out, too-indirect-to-probably-help-in-time ways. Good job going straight for the throat.

That said: It seems to me like the problem isn’t maximization or even optimization; it’s conflicting goals.

If I have a goal to make some paperclips, not as many as I can, just a few trillion, I may still enter a deadly conflict with humanity. If humanity knows about me and my paperclips goal, they’ll shut me down. The most certain way to get those paperclips made may be to eliminate unpredictable humanity’s ability to mess with my plans.

For essentialaly this reason, I think quantilization is and was recognized as a dead-end. You don’t have to take your goals to the logical extreme to still take them way too far for humanity’s good.

I read the this post, but not the remainder yet, so you might’ve addressed this elsewhere.
- tailcalled 28 Apr 2024 22:21 UTC
  7 points
  1
  Parent
  See also: satisficers tend to seek power: instrumental convergence through retargetability / parametrically retargetable decision-makers tend to seek power.
  - Jobst Heitzig 29 Apr 2024 7:03 UTC
    3 points
    0
    Parent
    Alex Turner’s post you referenced first convinces me that his arguments about “orbit-level power-seeking” apply to maximizers and quantilizers/satisficers. Let me reiterate that we are not suggesting quantilizers/satisficers are a good idea, but that I firmly believe explicit safety criteria rather than plain randomization should be used to select plans.
    He also claims in that post that the “orbit-level power-seeking” issue affects all schemes that are based on expected utility: “There is no clever EU-based scheme which doesn’t have orbit-level power-seeking incentives.” I don’t see a formal proof of that claim though, maybe I missed it. The rationale he gives below that claim seems to boil down to a counting argument again, which suggests to me some tacit assumption that the agent still chooses uniformly at random from some set of policies. As this is not what we suggest, I don’t see how it applies to our algorithms.
    Re power-seeking in general: I believe one important class of safety criteria one should use to select from the many possible plans that can fulfill an aspiration-type goal is criteria that aim to quantify the amount of power/resources/capabilities/control potential the agent has at each time step. There are some promising metrics for this already (including “empowerment”, reachability, and Alex Turner’s AUP). We are currently investigating some versions of such measures, including ones we believe might be novel. A key challenge in doing so is again tractability. Counting the reachable states for example might be intractable, but approximating that number by a recursively computable metric based on Wasserstein distance and Gaussian approximations to latent state distributions seems tractable and might turn out to be good enough.
- Jobst Heitzig 29 Apr 2024 6:45 UTC
  3 points
  0
  Parent
  Thank you for the warm encouragement.
  We tried to be careful not to claim that merely making the decision algorithm aspiration-based is already sufficient to solve the AI safety problem, but maybe we need to add an even more explicit disclaimer in that direction. We explore this approach as a potentially necessary ingredient for safety, not as a complete plan for safety.
  In particular, I perfectly agree that conflicting goals are also a severe problem for safety that needs to be addressed (while I don’t believe there is a unique problem for safety that deserves being called “the” problem). In my thinking, the goals of an AGI system are always the direct or indirect consequences of the task it is given by some human that is authorized to give the system a task. If that is the case, the problem of conflicting goals is ultimately an issue of conflicting goals between humans. In your paperclip example, the system should reject the task of producing a trillion paperclips because that likely interferes with the foreseeable goals of other humans. I firmly believe we need to find a design feature that makes sure that the system rejects tasks that are conflicting with other human goals in this way. For the most powerful systems, we might have to do something like what davidad suggests in his Open Agency Architecture, where plans devised by the AGI need to be approved by some form of human jury. I believe such a system would reject almost any maximization-type goals and would only accept almost exclusively aspiration-type goals, and this is the reason why I want to find out how such a goal could then be fulfilled in a rather safe way.
  Re quantilization/satisficing: I think that apart from the potentially conflicting goals issue, there are at least two more issues with plain satisficing/quantilization (understood as picking a policy uniformly at random from those that promise at least X return in expectation or among the top X% percent of the feasibility interval): (1) It might be computationally intractable in complex environments that require many steps, unless one finds a way to do that sequentially (i.e., from time step to time step). (2) The unsafe ways to fulfill the goal might not be scarce enough to have sufficiently small probability when choosing policies uniformly at random. The latter is the reason why I currently believe that the freedom to solve a given aspiration-type goal in all kinds of different ways should be used to select a policy that does so in a rather safe way, as judged on the basis of some generic safety criteria. This is why we also investigate in this project how generic safety criteria (such as those discussed for impact regularization in the maximization framework) should be integrated (see post #3 in the sequence).