simeon_c

Karma: 1,327

@SaferAI

simeon_c 7 Dec 2024 22:33 UTC
3 points
−6
in reply to: habryka’s comment on: Common misconceptions about OpenAI
250 upvotes is also crazy high. Another sign of the disastrous abilities of EA/LessWrong communities at character judgment.
The same is right now happening before our eyes on Anthropic. And similar crowds are as confidently asserting that this time they’re really the good guys.

simeon_c 4 Dec 2024 20:18 UTC
2 points
−1
on: Should there be just one western AGI project?
I just skimmed but just wanted to flag that I like Bengio’s proposal of one coordinated coalition that develops several AGIs in a coordinated fashion (e.g. training runs at the same time on their own clusters), which decreases the main downside of having one single AGI project (power concentration).

simeon_c 4 Dec 2024 20:15 UTC
13 points
−2
on: Responsible Scaling Policies Are Risk Management Done Wrong
I still agree with a lot of that post and am still essentially operating on it.

I also think that it’s interesting to read the comments because at the time the promise of those who thought my post was wrong was that Anthropic’s RSP would get better and that this was only the beginning. With RSP V2 being worse and less specific than RSP V1, it’s clear that this was overoptimistic.

Now, risk management in AI has also gone a lot more mainstream than it was a year ago, in large parts thanks to the UK AISI who started operating on it. People have also started using more probabilities, for instance in safety cases paper, which this post advocated for.

With SaferAI, my organization, we’re still continuing to work on moving the field closer from traditional risk management and ensuring that we don’t reinvent the wheel when there’s no need to. There should be releases going in that direction over the coming months.

Overall, if I look back on my recommendations, I think they’re still quite strong. “Make the name less misleading” hasn’t been executed on but other names than RSPs have started being used, such as Frontier AI Safety Commitments, which is a strong improvement from my “Voluntary safety commitments” suggestion.

My recommendation about what RSPs are and aren’t are also solid. My worry that the current commitments in RSPs would be pushed in policy was basically right: it’s been used in many policy conversations as an anchor for what to do and what not to do.

Finally, the push for risk management in policy that I wanted to see happen has mostly happened. This is great news.
The main thing that misses from this post is the absence of prediction of RSP launching the debate about what should be done and at what levels. This is overall a good effect which has happened, and would probably have happened several months after if not for the publication of RSPs. The fact that it was done in a voluntary commitment context is unfortunate, because it levels down everything, but I still think this effect was significant.

simeon_c 16 Oct 2024 16:58 UTC
4 points
2
in reply to: Zac Hatfield-Dodds’s comment on: Daniel Kokotajlo’s Shortform
I’d be interested in also exploring model-spec-style aspirational documents too.
Happy to do a call on model-spec-style aspirational documents if it’s any relevant. I think this is important and we could be interested in helping develop a template for it if Anthropic was interested in using it.

simeon_c 12 Oct 2024 21:42 UTC
6 points
0
on: Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming
Thanks for writing this post. I think the question of how to rule out risk post capability thresholds has generally been underdiscussed, despite it being probably the hardest risk management question with Transformers. In a recent paper, we coin “assurance properties” the research directions that are helpful for this particular problem.

Using a similar type of thinking applied to other existing safety techniques, it seems to me like interpretability is one of the only current LLM safety directions that can get you a big Bayes factor.

The second one where I felt like it could plausibly bring a big Bayes factor, although it was harder to think about because it’s still very early, was debate.

Otherwise, it seemed to me that stuff like RLHF / CAI / W2SG successes are unlikely to provide large Bayes factors.

simeon_c 10 Oct 2024 2:24 UTC
6 points
−4
on: Advice for journalists
This article fails to account for the fact that abiding by the rules suggested would mostly kill the ability of journalists to share the most valuable information they share with the public.

You don’t get to reveal stuff from the world most powerful organizations if you double check the quotes with them.

I think journalism is one of the professions where the consequentialist vs deontological ethics have the toughest trade-offs. It’s just really hard to abide by very high privacy standards and broke highly important news.

As one illustrative example, your standard would have prevented Kelsey Piper from sharing her conversation with SBF. Is that a desirable outcome? Not sure.

simeon_c 15 Sep 2024 17:13 UTC
7 points
1
in reply to: abstractapplic’s comment on: abstractapplic’s Shortform
Personally I use a mix of heuristics based on how important the new idea is, how rapid it is and how painful it will be to execute it in the future once the excitement dies down.
The more ADHD you are and the more the “burst of inspired-by-a-new-idea energy” effect is strong, so that should count.

simeon_c 14 Sep 2024 8:12 UTC
4 points
0
on: simeon_c’s Shortform
do people have takes on the most useful metrics/KPIs that could give a sense of how good are the monitoring/anti-misuse measures on APIs?
Some ideas:
a) average time to close an account conducting misuse activities (my sense is that as long as this is >1 day, there’s little chance to avoid that state actors use API-based models for a lot of misuse (everything which doesn’t require major scale))
b) the logs of the 5 accounts/interactions that have been ranked as highest severity (my sense is that incident reporting like OpenAI/Microsoft have done on cyber is very helpful to get a better mental model of what’s up/how bad things are going)
c) Estimate of the number of users having meaningful jailbroken interactions per month (in absolute value, to give a sense of how much people are misusing the models through API).

A lot of the open source worry has been implicitly assuming that it would be easier to use OS than closed source, but it’s unclear the extent to which it’s already the case and I’m looking for metrics that give some insight into that. My sense is that the misuse that will require more scale will likely rely more on OS but those who are more in the infohazard realm (e.g. chembio) would be done best through APIs.

simeon_c 26 Aug 2024 20:24 UTC
8 points
10
on: Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
This looks to be overwhelmingly the most likely in my opinion and I’m glad someone wrote this post. Thanks Buck

simeon_c 13 Jul 2024 15:12 UTC
16 points
13
in reply to: Neel Nanda’s comment on: Neel Nanda’s Shortform
Thanks for answering, that’s very useful.
My concern is that as far as I understand, a decent number of safety researchers are thinking that policy is the most important area, but because, as you mentioned, they aren’t policy experts and don’t really know what’s going on, they just assume that Anthropic policy work is way better than those actually working in policy judge it to be. I’ve heard from a surprisingly high number of people among the orgs that are doing the best AI policy work that Anthropic policy is mostly anti-helpful.
Somehow though, internal employees keep deferring to their policy team and don’t update on that part/take their beliefs seriously.
I’d generally bet Anthropic will push more for policies I personally support than any other lab, even if they may not push as much as I want them to.
If it’s true, it is probably true to an epsilon degree, and it might be wrong because of weird preferences of a non-safety industry actor. AFAIK, Anthropic has been pushing against all the AI regulation proposals to date. I’ve still to hear a positive example.

simeon_c 12 Jul 2024 19:29 UTC
11 points
14
in reply to: Neel Nanda’s comment on: Neel Nanda’s Shortform
How aware were you (as an employee) & are you (now) of their policy work? In a world model where policy is the most important stuff, it seems to me like it could tarnish very negatively Anthropic’s net impact.

simeon_c 30 Jun 2024 7:58 UTC
2 points
0
in reply to: Seth Herd’s comment on: simeon_c’s Shortform
This is the best alignment plan I’ve heard in a while.

simeon_c 29 Jun 2024 8:29 UTC
14 points
0
on: simeon_c’s Shortform
You are a LessWrong reader, want to push humanity’s wisdom and don’t know how to do so? Here’s a workflow:
1. Pick an important topic where the entire world is confused
2. Post plausible sounding takes with a confident tone on it
3. Wait for Gwern’s comment on your post
4. Problem solved
See an application of the workflow here: https://www.lesswrong.com/posts/epgCXiv3Yy3qgcsys/you-can-t-predict-a-game-of-pinball?commentId=wjLFhiWWacByqyu6a

simeon_c 16 Jun 2024 14:19 UTC
2 points
0
on: simeon_c’s Shortform
Playing catch-up is way easier than pushing the frontier of LLM research. One is about guessing which path others took, the other one is about carving a path among all the possible ideas that could work.
If China stopped having access to US LLM secrets and had to push the LLM frontier rather than playing catch up, how slower would it be at doing so?
My guess is at least >2x and probably more but I’d be curious to get takes.

simeon_c 30 May 2024 21:04 UTC
43 points
52
on: Non-Disparagement Canaries for OpenAI
Great initiative! Thanks for leading the charge on this.

simeon_c 29 May 2024 19:19 UTC
2 points
0
in reply to: lemonhope’s comment on: AI companies aren’t really using external evaluators
Jack Clark: “Pre-deployment testing is a nice idea but very difficult to implement,” from https://www.politico.eu/article/rishi-sunak-ai-testing-tech-ai-safety-institute/

simeon_c 26 May 2024 7:13 UTC
2 points
0
in reply to: Ben Pace’s comment on: simeon_c’s Shortform
Thanks for the answer it makes sense.

To be clear I saw it thanks to Matt who did this tweet so credit goes to him: https://x.com/SpacedOutMatt/status/1794360084174410104?t=uBR_TnwIGpjd-y7LqeLTMw&s=19

simeon_c 25 May 2024 21:39 UTC
24 points
0
on: simeon_c’s Shortform
Lighthaven City for 6.6M€? Worth a look by the Lightcone team.

https://x.com/zillowgonewild/status/1793726646425460738?t=zoFVs5LOYdSRdOXkKLGh4w&s=19

simeon_c 24 May 2024 20:03 UTC
20 points
15
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Thanks for sharing. It’s both disturbing from a moral perspective and fascinating to read.

simeon_c 24 May 2024 19:50 UTC
21 points
17
on: AI companies aren’t really using external evaluators
Very important point that wasn’t on my radar. Thanks a lot for sharing.