Questions for labs

Zach Stein-PerlmanApr 30, 2024, 10:15 PM

77 points

Associated with AI Lab Watch, I sent questions to some labs a week ago (except I failed to reach Microsoft). I didn’t really get any replies (one person replied in their personal capacity; this was very limited and they didn’t answer any questions). Here are most of those questions, with slight edits since I shared them with the labs + questions I asked multiple labs condensed into the last two sections.

Lots of my questions are normal I didn’t find public info on this safety practice and I think you should explain questions. Some are more like it’s pretty uncool that I can’t find the answer to this — like: breaking commitments, breaking not-quite-commitments and not explaining, having ambiguity around commitments, and taking credit for stuff^[1] when it’s very unclear that you should get credit are pretty uncool.

Anthropic

Internal governance stuff (I’m personally particularly interested in these questions—I think Anthropic has tried to set up great internal governance systems and maybe it has succeeded but it needs to share more information for that to be clear from the outside):

Who is on the board and what’s up with the LTBT?^[2] In September, Vox reported “The Long-Term Benefit Trust . . . will elect a fifth member of the board this fall.” Did that happen? (If so: who is it? when did this happen? why haven’t I heard about this? If not: did Vox hallucinate this or did your plans change (and what is the plan)?)
What are the details on the “milestones” for the LTBT and how stockholders can change/abrogate the LTBT? Can you at least commit that we’d quickly hear about it if stockholders changed/abrogated the LTBT? (Why hasn’t this been published?)
What formal powers do investors/stockholders have, besides abrogating the LTBT? (can they replace the two board members who represent them? can they replace other board members?)
What does Anthropic owe to its investors/stockholders? (any fiduciary duty? any other promises or obligations?) I think balancing their interests with pursuit of the mission; anything more concrete?
- I’m confused about what such balancing-of-interests entails. Oh well.
Who holds Anthropic shares + how much? At least: how much is Google + Amazon?

Details of when the RSP triggers evals:

“During model training and fine-tuning, Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements.” Assuming effective compute scales less than 4x per 3 months, the 4x part will never matter, right? (And insofar as AI safety people fixate on the “4x” condition, they are incorrect to do so?) Or do you have different procedures for a 4x-eval vs a 3-month-eval, e.g. the latter uses the old model just with new finetuning/prompting/scaffolding/etc.?
Evaluation during deployment? I am concerned that improvements in fine-tuning and inference-time enhancements (prompting, scaffolding, etc.) after a model is deployed will lead to dangerous capabilities. Especially if models can be updated to increase their capabilities without evals.
- Do you do the evals during deployment?
- The RSP says “If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will” do stuff. How would that become apparent — via the regular evals or ad-hoc just-noticing?
- If you do do evals during deployment: suppose you have two models such that each is better than the other at some tasks (perhaps because a powerful model is deployed and a new model is in progress with a new training setup). Every 3 months, would you do full evals on both models, or what?

Deployment commitments: does Anthropic consider itself bound by any commitments about deployment it has made in the past besides those in its RSP (and in particular not meaningfully advancing the frontier)? (Why hasn’t it clarified this after the confusion around Anthropic’s commitments after the Claude 3 launch?)

You shared Claude 2 with METR. Did you let METR or any other external parties do model evals for dangerous capabilities on Claude 3 before deployment? Do you plan for such things in the future? (To be clear, I don’t just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)

[new] You’ve suggested that sharing models before deployment is a hard engineering problem.^[3] I get it if you’re worried about leaks but I don’t get how it could be a hard engineering problem — just share API access early, with fine-tuning. I think my technically inclined friends have the same attitude. If it’s actually a hard engineering problem, it would be nice to explain why.

I think it would be really cool if y’all said more about lots of safety stuff, to help other labs do better. Like, safety evals, and red-teaming, and the automated systems mentioned here, and all the other safety stuff you do.

OpenAI

Preparedness Framework:

How does the PF interact with sharing models with others, especially Microsoft? My impression is that OpenAI is required to share dangerous models with Microsoft, and of course the PF doesn’t bind Microsoft, so the PF is consistent with Microsoft deploying OpenAI’s models unsafely.
Do you believe that a system just below the Critical threshold in each risk category is extremely unlikely to be sufficiently capable to cause a global catastrophe? (if yes, I’d like to argue with you about that. if no, don’t the thresholds need to be lower?)
The beta PF says we need dependable evidence that the model is sufficiently aligned that it does not initiate “critical”-risk-level tasks unless explicitly instructed to do so. I’m not sure what this really looks like. How could you get dependable evidence, particularly given the possibility that the model is scheming?
- There are multiple ways to interpret the italicized sentence and it seems important to make sure everyone is on the same page.
The deployment commitments in the PF seem to just refer to external deployment. (This isn’t super clear, but it says “Deployment in this case refers to the spectrum of ways of releasing a technology for external impact.”) (This contrasts with Anthropic’s RSP, in which “deployment” includes internal use.) Do you commit to safeguards around internal deployment (beyond the commitments about development)?
The PF could entail pausing deployment or development. Does OpenAI have a plan for the details of what it would do if it needed to pause for safety? (E.g.: what would staff work on during a pause? does OpenAI stay financially prepared for a pause?)
“We will be running these evaluations continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training.”
- How can you run evals before training?
- Will you run evals during deployment?
  - If so, what if you have two models such that each is better than the other at some tasks (perhaps because a powerful model is deployed and a new model is in progress with a new training setup). Would you do full evals on both models, or what?
Can you commit that you’ll publish changes to the PF before adopting them, and ideally seek feedback from stakeholders? Or at least that you’ll publish changes when you adopt them?

Internal governance:

OpenAI recently said:
Key enhancements [to OpenAI governance] include:
* Adopting a new set of corporate governance guidelines;
* Strengthening OpenAI’s Conflict of Interest Policy;
* Creating a whistleblower hotline to serve as an anonymous reporting resource for all OpenAI employees and contractors; and
* Creating additional Board committees, including a Mission & Strategy committee focused on implementation and advancement of the core mission of OpenAI.
Can you share details? I’m particularly interested in details on the “corporate governance guidelines” and the whistleblower policy.
What does OpenAI owe to its investors? (any fiduciary duty? any other promises or obligations?) Do investors have any formal powers?
What powers does the board have, besides what’s mentioned in the PF?

What are your plans for the profit cap for investors?

You shared GPT-4 with METR. Are you planning to let METR or any other external parties do model evals for dangerous capabilities before future deployments? (To be clear, I don’t just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)

DeepMind

DeepMind and Google have various councils and teams related to safety (see e.g. here): the Responsible AI Council, the Responsible Development and Innovation team, the Responsibility and Safety Council, the Advanced Technology Review Council, etc. From the outside, it’s difficult to tell whether they actually improve safety. Can you point me to details on what they do, what exactly they’re responsible for, what their powers are, etc.?

My impression is that when push comes to shove, Google can do whatever it wants with DeepMind; DeepMind and its leadership have no hard power. Is this correct? If it is mistaken, can you clarify the relationship between DeepMind and Google?

I hope DeepMind will soon make RSP-y commitments — like, dangerous capability evals before deployment + at least one risk threshold and how it would affect deployment decisions + a plan for making safety arguments after models have dangerous capabilities.

I wish DeepMind (or Google) would articulate a safety plan in the company’s voice, clearly supported by leadership, rather than leaving DeepMind safety folks to do so in their personal voices.

I’m kind of confused about the extent to which DeepMind and Google have a mandate for preventing extreme risks and sharing the benefits of powerful AI. Would such a mandate be expressed in places besides DeepMind’s About page, Google’s AI Principles, and Google’s Responsible AI Practices? Do any internal oversight bodies have explicit mandates along these lines?

What’s the status of DeepMind’s old Operating Principles?

“You may not use the Services to develop models that compete with Gemini API or Google AI Studio. You also may not attempt to extract or replicate the underlying models (e.g., parameter weights).” Do you enforce this? How? Do you do anything to avoid helping others create powerful models (via model inversion or just imitation learning)?
“The Services include safety features to block harmful content, such as content that violates our Prohibited Use Policy. You may not attempt to bypass these protective measures or use content that violates the API Terms or these Additional Terms.” Do you enforce this? How?
Do you enforce your Generative AI Prohibited Use Policy? How?

Microsoft

Microsoft says:

When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery. We established the joint DSB’s processes in 2021, anticipating a need for a comprehensive pre-release review process focused on AI safety and alignment, well ahead of regulation or external commitments mandating the same.
We have exercised this review process with respect to several frontier models, including GPT-4. Using Microsoft’s Responsible AI Standard and OpenAI’s experience building and deploying advanced AI systems, our teams prepare detailed artefacts for the joint DSB review. Artefacts record the process by which our organizations have mapped, measured, and managed risks, including through the use of adversarial testing and third-party evaluations as appropriate. We continue to learn from, and refine, the joint DSB process, and we expect it to evolve over time.

How do you measure capabilities? What are the capability thresholds? How does the review work? Can you share the artifacts? What are other details I’d want to know?

OpenAI-Microsoft relationship

What access does Microsoft have to OpenAI’s models and IP?
What exactly is Microsoft owed or promised by OpenAI?
What is your joint “Deployment Safety Board”? How does it work?

[At least three different labs]

[new] Did you commit to share models with UK AISI before deployment? (I suspect Rishi Sunak or Politico hallucinated this.)

Do you let external parties do model evals for dangerous capabilities before you deploy models? (To be clear, I don’t just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)

Do you do anything to avoid helping others create powerful models (via model inversion or just imitation learning)? Do you enforce any related provisions in your terms of service?

Do you have a process for staff to follow if they have concerns about safety? (Where have you written about it, or can you share details?) What about concerns/suggestions about risk assessment policies or their implementation?

Do you ever keep research private for safety reasons? How do you decide what research to publish; how do safety considerations determine what you publish?

Do AI oversight bodies within the lab have a mandate for safety and preventing extreme risks? Please share details on how they work.

Security:

Can you commit to publicly disclose all breaches of your security?
Can you publish the reports related to your security certifications/audits/pentests (redacting sensitive details but not the overall evaluation)?
Do you limit uploads from clusters with model weights? How?
Do you use multiparty access controls? For what? How many people have access to model weights? Would your controls actually stop a compromised staff member from accessing the weights?
Do you secure developers’ machines? How?
How hard would it be for China to exfiltrate model weights that you were trying to protect?

Do you require nontrivial KYC for some types of model access? If so, please explain or point me to details.

What do you do to improve adversarial robustness, e.g. to prevent jailbreaking? (During training and especially at inference-time.)

Can you promise that you don’t use non-disparagement agreements (nor otherwise discourage current or past staff or board members from talking candidly about their impressions of and experiences with the company)?

I’m currently uncertain about how system prompts can prevent misuse. Maybe they can’t and I won’t write about this. But in case they can: How do you think about system prompts and preventing misuse?

(Note: if a public source is kinda related to a question but doesn’t directly answer it, probably I’m already aware of it and mention it on AI Lab Watch.)

^
I’m thinking of internal governance stuff and risk assessment / RSP-y stuff.
^
At least among my friends, Anthropic gets way more credit for LTBT than it deserves based on public information. For all we know publicly, Google and Amazon can abrogate the LTBT at will! Failing to share the details on the LTBT but claiming credit for the LTBT being great is not cool.
^
Anthropic:
Engaging external experts has the advantage of leveraging specialized domain expertise and increasing the likelihood of an unbiased audit. Initially, we expected this collaboration [with METR] to be straightforward, but it ended up requiring significant science and engineering support on our end. Providing full-time assistance diverted resources from internal evaluation efforts.
Anthropic cofounder Jack Clark:
Pre-deployment testing is a nice idea but very difficult to implement.

What links here?