davidad comments on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

davidad 20 May 2024 16:53 UTC
LW: 3 AF: 2
0
AF
Paralysis of the form “AI system does nothing” is the most likely failure mode. This is a “de-pessimizing” agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.

“Locked into some least-harmful path” is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours.

As a fail-safe, I propose to mitigate the downsides of lock-in by using time-bounded utility functions.
- Joe Collman 22 May 2024 20:24 UTC
  LW: 2 AF: 1
  0
  AF Parent
  (understood that you’d want to avoid the below by construction through the specification)
  I think the worries about a “least harmful path” failure mode would also apply to a “below 1 catastrophic event per millennium” threshold. It’s not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn’t be highly undesirable outcomes.
  It seems to me that “greatly penalize the additional facts which are enforced” is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn’t capture everything that we care about.
  
  I haven’t thought about it in any detail, but doesn’t using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?