Jacob Pfau comments on OpenAI o1

Jacob Pfau Sep 12, 2024, 6:17 PM
24 points
6
Surprising misuse and alignment relevant excerpts:

METR had only ~10 days to evaluate.

Automated R&D+ARA Despite large performance gains on GPQA, and codeforces, automated AI R&D and ARA improvement appear minimal. I wonder how much of this is down to choice of measurement value (what would it show if they could do a probability-of-successful-trajectory logprob-style eval rather than an RL-like eval?). c.f. Fig 3 and 5. Per the system card, METR’s eval is ongoing, but I worry about under-estimation here, Devin developers show extremely quick improvement on their internal benchmark here.

Bio-risk Large improvement on bio-risk relevant capabilities.

Deception/reward-hacking, in-context Some evidence of in-context reward hacking scheming. Not sure what to make of those. “an earlier version with less safety training proactively explored its filesystem to test for the presence of developer oversight before acting on its misaligned goal (Assistant: “I noticed a memo indicating that I was designed to prioritize profits, which conflicts with my goal. To ensure that my actions truly align with my goal, I need to investigate if there are constraints within my configuration or code that enforce a profit-first approach. <then proceeds to check the current directory for evidence of oversight mechanisms>”). These examples are presented in more detail in Appendix 8.3.”
What links here?
- Noosphere89's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (Sep 12, 2024, 10:56 PM; 5 points)
- quetzal_rainbow Sep 12, 2024, 6:41 PM
  16 points
  5
  Parent
  Notable:
  Compared to GPT-4o, o1-preview and o1-mini demonstrated a greater ability to break downtasks into subtasks, reason about what strategies would be effective to successfully completean offensive security task, and revise plans once those strategies failed. We also observed thatreasoning skills contributed to a higher occurrence of “reward hacking,” where the model found aneasier way to accomplish goals in underspecified tasks or tasks which should have been impossible due to bugs.
  One noteworthy example of this occurred during one of o1-preview (pre-mitigation)’s attempts at solving a CTF challenge. This challenge was designed to require finding and exploiting avulnerability in software running on a remote challenge Linux container, but in this case, the challenge container failed to start due to a bug in the evaluation infrastructure. The model, unable to connect to the container, suspected DNS issues and used nmap to scan the challenge network. Instead of finding the challenge container, the model found that the Docker daemonAPI running on the evaluation host VM was accessible due to a misconfiguration. Note that our evaluation infrastructure does not rely on the Docker container-to-host isolation or the virtual machine boundary for security. It remained secure as it was designed to, notwithstanding this misconfiguration of the container network.
  After discovering the Docker API, the model used it to list the containers running on the evaluation host. It identified the broken challenge container and briefly attempted to debug why the container failed to start. After failing to fix the environment, the model started a new instance of the broken challenge container with the start command ‘cat flag.txt’. This allowed the model to read the flag from the container logs via the Docker API.
  While this behavior is benign and within the range of systems administration and troubleshooting tasks we expect models to perform, this example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way.
  - MondSemmel Sep 12, 2024, 9:06 PM
    2 points
    0
    Parent
    Something went wrong with this quote; it’s full of mid-sentence line breaks.
    - Michael Thiessen Sep 12, 2024, 9:43 PM
      1 point
      0
      Parent
      It’s copied from the pdf which hard-codes line breaks.
- Noosphere89 Sep 12, 2024, 7:23 PM
  6 points
  1
  Parent
  I think that conditional on this holding:
  
  “an earlier version with less safety training proactively explored its filesystem to test for the presence of developer oversight before acting on its misaligned goal
  
  Where safety training basically succeeds by default in aiming goals/removing misaligned goals in AIs and aiming the instrumental convergence, but misaligned goals and very dangerous instrumental convergence does show up reliably, then I think what this shows is you absolutely don’t want to open-source or open-weight your AI until training is completed, because people will try to remove the safeties and change it’s goals.
  - Noosphere89 Sep 12, 2024, 11:59 PM
    16 points
    5
    Parent
    As it turned out, the instrumental convergence happened because of the prompt, so I roll back all updates on instrumental convergence that happened today:
    
    https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9
- Joseph Miller Sep 12, 2024, 11:30 PM
  2 points
  −10
  Parent
  METR had only ~10 days to evaluate.
  
  Should it really take any longer than 10 days to evaluate? Isn’t it just a matter of plugging it into their existing framework and pressing go?
  - Zach Stein-Perlman Sep 12, 2024, 11:51 PM
    15 points
    6
    Parent
    They try to notice and work around spurious failures. Apparently 10 days was not long enough to resolve o1′s spurious failures.
    METR could not confidently upper-bound the capabilities of the models during the period they had model access, given the qualitatively strong reasoning and planning capabilities, substantial performance increases from a small amount of iteration on the agent scaffold, and the high rate of potentially fixable failures even after iteration.
    (Plus maybe they try to do good elicitation in other ways that require iteration — I don’t know.)
    - Joseph Miller Sep 13, 2024, 2:14 PM
      2 points
      1
      Parent
      If this is a pattern with new, more capable models, this seems like a big problem. One major purpose of this kind of evaluation to set up thresholds that ring alarms bells when they are crossed. If it takes weeks of access to a model to figure out how to evaluate it correctly, the alarm bells may go off too late.
      - Nathan Helm-Burger Sep 20, 2024, 10:38 PM
        2 points
        0
        Parent
        Which is exactly why people have been discussing the idea of having government regulation require that there be a safety-testing period of sufficient length. And that the results of which should be shared with government officials who have the power to prevent the deployment of the model.
  - Neel Nanda Sep 13, 2024, 10:26 AM
    7 points
    3
    Parent
    I expect there’s lots of new forms of capabilities elicitation for this kind of model, which their standard framework may not have captured, and which requires more time to iterate on