Nathan Helm-Burger comments on My Alignment “Plan”: Avoid Strong Optimisation and Align Economy

Nathan Helm-Burger 31 Jan 2024 18:01 UTC
4 points
2
I think some things we can do to better our chances include:
- enforcing sandboxed testing of frontier models before they are deployed, using independent audits by governments or outside companies. This could potentially prevent a model which has undergone a sharp left turn from escaping.
- better ways of testing for potential harms from AI systems, expanding the set of available evals of various sorts of risk
- putting more collective resources into AI safety: alignment research, containment preparations, worldwide monitoring, international treaties
- ensure that a militarily dominant coalition of nations is in agreement that should a Rogue AGI arise in the world, that their best chance of survival would be a rapid forceful response to stomp it out before it gains too much power. Have sufficient definitions and agreed upon procedures in place such that action could follow automatically from detection without need for lengthy discussion.
- RussellThor 1 Feb 2024 3:06 UTC
  3 points
  0
  Parent
  What about quickly distributing frontier AI when it is shown to be safe? That is risky of course if it isn’t safe, however if the deployed AI is as powerful and distributed as far as possible, then a bad AI would need to be more powerful comparatively to take over.
  So
  AI(x-1) is everywhere and protecting as much as possible, AI(x) is sandboxed
  VS
  AI(x-2) is protecting everything, AI(x-1) is in a few places, AI(x) is sandboxed.
  - the gears to ascension 1 Feb 2024 4:18 UTC
    3 points
    5
    Parent
    or the bad ai is able to hack every copy of the widely distributed ai the same way, making the question moot.
    - RussellThor 1 Feb 2024 5:07 UTC
      2 points
      −1
      Parent
      But it would surely be more likely to hack x-2 than x-1?
      - the gears to ascension 1 Feb 2024 5:12 UTC
        2 points
        0
        Parent
        Right, and it would be easier to hack, since it has the same adversarial examples, right?
        
        Oh, wait, I see what you’re saying. No I think hacking x-1 and x-2 will both be trivial. AIs are basically zero secure right now.
        VojtaKovarik 1 Feb 2024 19:25 UTC
        2 points
        0
        Parent
        I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the “distribute AI(x-1) quickly” part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the “single point of failure” effect, though it seems unclear how large.)
- VojtaKovarik 31 Jan 2024 20:09 UTC
  1 point
  0
  Parent
  To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.
  But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.