@SaferAI
simeon_c
Towards Quantitative AI Risk Management
Thanks for writing this post. I think the question of how to rule out risk post capability thresholds has generally been underdiscussed, despite it being probably the hardest risk management question with Transformers. In a recent paper, we coin “assurance properties” the research directions that are helpful for this particular problem.
Using a similar type of thinking applied to other existing safety techniques, it seems to me like interpretability is one of the only current LLM safety directions that can get you a big Bayes factor.
The second one where I felt like it could plausibly bring a big Bayes factor, although it was harder to think about because it’s still very early, was debate.
Otherwise, it seemed to me that stuff like RLHF / CAI / W2SG successes are unlikely to provide large Bayes factors.
This article fails to account for the fact that abiding by the rules suggested would mostly kill the ability of journalists to share the most valuable information they share with the public.
You don’t get to reveal stuff from the world most powerful organizations if you double check the quotes with them.
I think journalism is one of the professions where the consequentialist vs deontological ethics have the toughest trade-offs. It’s just really hard to abide by very high privacy standards and broke highly important news.
As one illustrative example, your standard would have prevented Kelsey Piper from sharing her conversation with SBF. Is that a desirable outcome? Not sure.
Personally I use a mix of heuristics based on how important the new idea is, how rapid it is and how painful it will be to execute it in the future once the excitement dies down.
The more ADHD you are and the more the “burst of inspired-by-a-new-idea energy” effect is strong, so that should count.
do people have takes on the most useful metrics/KPIs that could give a sense of how good are the monitoring/anti-misuse measures on APIs?
Some ideas:
a) average time to close an account conducting misuse activities (my sense is that as long as this is >1 day, there’s little chance to avoid that state actors use API-based models for a lot of misuse (everything which doesn’t require major scale))b) the logs of the 5 accounts/interactions that have been ranked as highest severity (my sense is that incident reporting like OpenAI/Microsoft have done on cyber is very helpful to get a better mental model of what’s up/how bad things are going)
c) Estimate of the number of users having meaningful jailbroken interactions per month (in absolute value, to give a sense of how much people are misusing the models through API).
A lot of the open source worry has been implicitly assuming that it would be easier to use OS than closed source, but it’s unclear the extent to which it’s already the case and I’m looking for metrics that give some insight into that. My sense is that the misuse that will require more scale will likely rely more on OS but those who are more in the infohazard realm (e.g. chembio) would be done best through APIs.
This looks to be overwhelmingly the most likely in my opinion and I’m glad someone wrote this post. Thanks Buck
Thanks for answering, that’s very useful.
My concern is that as far as I understand, a decent number of safety researchers are thinking that policy is the most important area, but because, as you mentioned, they aren’t policy experts and don’t really know what’s going on, they just assume that Anthropic policy work is way better than those actually working in policy judge it to be. I’ve heard from a surprisingly high number of people among the orgs that are doing the best AI policy work that Anthropic policy is mostly anti-helpful.
Somehow though, internal employees keep deferring to their policy team and don’t update on that part/take their beliefs seriously.
I’d generally bet Anthropic will push more for policies I personally support than any other lab, even if they may not push as much as I want them to.
If it’s true, it is probably true to an epsilon degree, and it might be wrong because of weird preferences of a non-safety industry actor. AFAIK, Anthropic has been pushing against all the AI regulation proposals to date. I’ve still to hear a positive example.
How aware were you (as an employee) & are you (now) of their policy work? In a world model where policy is the most important stuff, it seems to me like it could tarnish very negatively Anthropic’s net impact.
This is the best alignment plan I’ve heard in a while.
You are a LessWrong reader, want to push humanity’s wisdom and don’t know how to do so? Here’s a workflow:
Pick an important topic where the entire world is confused
Post plausible sounding takes with a confident tone on it
Wait for Gwern’s comment on your post
Problem solved
See an application of the workflow here: https://www.lesswrong.com/posts/epgCXiv3Yy3qgcsys/you-can-t-predict-a-game-of-pinball?commentId=wjLFhiWWacByqyu6a
Playing catch-up is way easier than pushing the frontier of LLM research. One is about guessing which path others took, the other one is about carving a path among all the possible ideas that could work.
If China stopped having access to US LLM secrets and had to push the LLM frontier rather than playing catch up, how slower would it be at doing so?
My guess is at least >2x and probably more but I’d be curious to get takes.
Great initiative! Thanks for leading the charge on this.
Jack Clark: “Pre-deployment testing is a nice idea but very difficult to implement,” from https://www.politico.eu/article/rishi-sunak-ai-testing-tech-ai-safety-institute/
Thanks for the answer it makes sense.
To be clear I saw it thanks to Matt who did this tweet so credit goes to him: https://x.com/SpacedOutMatt/status/1794360084174410104?t=uBR_TnwIGpjd-y7LqeLTMw&s=19
Lighthaven City for 6.6M€? Worth a look by the Lightcone team.
https://x.com/zillowgonewild/status/1793726646425460738?t=zoFVs5LOYdSRdOXkKLGh4w&s=19
Thanks for sharing. It’s both disturbing from a moral perspective and fascinating to read.
Very important point that wasn’t on my radar. Thanks a lot for sharing.
So first the 85% net worth thing went quite viral several times and made Daniel Kokotajlo a bit of a heroic figure on Twitter.
Then Kelsey Piper’s reporting pushed OpenAI to give back Daniel’s vested units. I think it’s likely that Kelsey used elements from this discussion as initial hints for her reporting and plausible that the discussion sparked her reporting, I’d love to have her confirmation or denial on that.
I’m not gonna lie, I’m pretty crazily happy that a random quick take I wrote 10m on a Friday morning about how Daniel Kokotajlo should get social reward and get partial refunding sparked a discussion that seems to have caused positive effects wayyyy beyond expectations.
Quick takes is an awesome innovation, it allows to post even when one is still partially confused/uncertain about sthg. Given the confusing details of the situation in that case, this wd pbbly not have happened otherwise.
Happy to do a call on model-spec-style aspirational documents if it’s any relevant. I think this is important and we could be interested in helping develop a template for it if Anthropic was interested in using it.