Mateusz Bagiński comments on evhub’s Shortform

Mateusz Bagiński Jan 8, 2025, 11:28 AM
2 points
−3
1. Introduce third-party mission alignment red teaming.
Anthropic should invite external parties to scrutinize and criticize Anthropic’s instrumental policy and specific actions based on whether they are actually advancing Anthropic’s stated mission, i.e. safe, powerful, and beneficial AI.
Tentatively, red-teaming parties might include other AI labs (adjusted for conflict of interest in some way?), as well as AI safety/alignment/risk-mitigation orgs: MIRI, Conjecture, ControlAI, PauseAI, CEST, CHT, METR, Apollo, CeSIA, ARIA, AI Safety Institutes, Convergence Analysis, CARMA, ACS, CAIS, CHAI, &c.
For the sake of clarity, each red team should provide a brief on their background views (something similar to MIRI’s Four Background Claims).
Along with their criticisms, red teams would be encouraged to propose somewhat specific changes, possibly ordered by magnitude, with something like “allocate marginally more funding to this” being a small change and “pause AGI development completely” being a very big change. Ideally, they should avoid making suggestions that include the possibility of making a small improvement now that would block a big improvement later (or make it more difficult).
Since Dario seems to be very interested in “race to the top” dynamics: if this mission alignment red-teaming program successfully signals well about Anthropic, other labs should catch up and start competing more intensely to be evaluated as positively as possible by third parties (“race towards safety”?).
It would also be good to have a platform where red teams can converse with Anthropic, as well as with each other, and the logs of their back-and-forth are published to be viewed by the public.
Anthropic should commit to taking these criticisms seriously. In particular, given how large the stakes are, they should commit to taking something like “many parties believe that Anthropic in its current form might be net-negative, even increasing the risk of extinction from AI” as a reason to pause or slow down, even if that’s contrary to their inside view.
2. Anthropic should make an explicit statement about its infohazard policy.
This statement should include how Anthropic thinks about and how it handles doing and publishing research that advances AGI development and doesn’t benefit safety/alignment/x-risk reduction to an extent sufficient to offset its contribution to (likely unsafe by default) AGI development.