Am I correct in interpreting that your definition of “found system” would apply nearly all useful AI systems today such as ChatGPT, as these are algorithms which run on weights that are found with optimization methods such as gradient descent? If so, it is still fairly onerous.
starship006
Thanks for writing this and proposing a plan. Coincidentally, I drafted a short take here yesterday explaining one complaint I currently have with the safety conditions of this plan. In short, I suspect the “No AIs improving other AIs” criterion isn’t worth including within a safety plan: it i) doesn’t address that many more marginal threat models (or does so ineffectively) and ii) would be too unpopular to implement (or, alternatively, too weak to be useful).
I think there is a version of this plan with a lower safety tax, with more focus on reactive policy and the other three criterion, that I would be more excited about.
Another reason why layernorm is weird (and a shameless plug): the final layernorm also contributes to self-repair in language models
Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment—for example, for me to classify Sydney as ‘obviously scheming’, I would need to see examples of Sydney 1) realizing it is in deployment and thus acting ‘misaligned’ or 2) realizing it is in training and thus acting ‘aligned’.
In what manner was Sydney ‘pretty obviously scheming’? Feels like the misalignment displayed by Sydney is fairly different than other forms of scheming I would be concerned about
(if this is a joke, whoops sorry)
I’m surprised by this reaction. It feels like the intersection between people who have a decent shot of getting hired at OpenAI to do safety research and those who are unaware of the events at OpenAI related to safety are quite low.
What Comes Next
Coding got another big leap, both for professionals and amateurs.
Claude is now clearly best. I thought for my own purposes Claude Opus was already best even after GPT-4o, but not for everyone, and it was close. Now it is not so close.
Claude’s market share has always been tiny. Will it start to rapidly expand? To what extent does the market care, when most people didn’t in the past even realize they were using GPT-3.5 instead of GPT-4? With Anthropic not doing major marketing? Presumably adaptation will be slow even if they remain on top, especially in the consumer market.
Yet with what is reportedly a big jump, we could see a lot of wrappers and apps start switching over rapidly. Developers have to be more on the ball.
How long should we expect Claude 3.5 Sonnet to remain on top?
I do not expect anyone except Google or OpenAI to pose a threat any time soon.
Was surprised by the one-shot Hanson lecture result. From it, I decided to see if Claude could identify your text, so I plugged in the above after asking “Who do you think the author behind this paragraph is?” Response:
Based on the style, content, and context of this paragraph, I believe the author is likely Zvi Mowshowitz.
Here’s my reasoning:
The writing style is analytical and forward-looking, characteristic of someone who frequently discusses AI developments and their implications.
The author mentions Claude specifically and compares it to GPT models, showing familiarity with different AI models and their capabilities.
The discussion of market share, adoption rates, and speculation about future developments is typical of someone who closely follows the AI industry.
The mention of “wrappers and apps” and consideration of developer behavior suggests someone with knowledge of the AI ecosystem.
Zvi Mowshowitz is known for writing about AI developments, particularly on his blog “Don’t Worry About the Vase” and other platforms. He often provides analysis of new AI models and their potential impacts.
The structure and tone of the writing is similar to Zvi’s typical style, with a mix of personal opinion and broader industry analysis.
However, I should note that while this is my best guess based on the available information, I can’t be absolutely certain. If you’re looking for a definitive answer, it would be best to verify the source directly.
Would you like me to elaborate on any part of this analysis?
Wow
It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved.
Small nitpick (I agree with mostly everything else in the post and am glad you wrote it up). This feels like an unfair criticism—I assume you are referring specifically to the statement in their paper that:
Although advocates for AI safety guidelines often allude to the “black box” nature of AI models, where the logic behind their conclusions is not transparent, recent advancements in the AI sector have resolved this issue, thereby ensuring the integrity of open-source code models.
I think Anthropic’s interpretability team, while making maybe dubious claims about the impact of their work on safety, has been clear that mechanistic interpretability is far from ‘solved.’ For instance, Chris Olah in the linked NYT article from today:
“There are lots of other challenges ahead of us, but the thing that seemed scariest no longer seems like a roadblock,” he said.
Also, in the paper’s section on Inability to Evaluate:
it’s unclear that they’re really getting at the fundamental thing we care about
I think they are overstating how far/useful mechanistic interpretability is currently. However, I don’t think this messaging is close to ‘mechanistic interpretability solves AI Interpretability’ - this error is on a16z, not Anthropic.
Less important, but the grant justification appears to take seriously the idea that making AGI open source is compatible with safety. I might be missing some key insight, but it seems trivially obvious why this is a terrible idea even if you’re only concerned with human misuse and not misalignment.
Hmmm, can you point to where you think the grant shows this? I think the following paragraph from the grant seems to indicate otherwise:
When OpenAI launched, it characterized the nature of the risks – and the most appropriate strategies for reducing them – in a way that we disagreed with. In particular, it emphasized the importance of distributing AI broadly;1 our current view is that this may turn out to be a promising strategy for reducing potential risks, but that the opposite may also turn out to be true (for example, if it ends up being important for institutions to keep some major breakthroughs secure to prevent misuse and/or to prevent accidents). Since then, OpenAI has put out more recent content consistent with the latter view,2 and we are no longer aware of any clear disagreements. However, it does seem that our starting assumptions and biases on this topic are likely to be different from those of OpenAI’s leadership, and we won’t be surprised if there are disagreements in the future.
I’m glad to hear you got exposure to the Alignment field in SERI MATS! I still think that your writing reads off as though your ideas misunderstands core alignment problems, so my best feedback then is to share drafts/discuss your ideas with other familiar with the field. My guess is that it would be preferable for you to find people who are critical of your ideas and try to understand why, since it seems like they are representative of the kinds of people who are downvoting your posts.
(preface: writing and communicating is hard and that i’m glad you are trying to improve)
i sampled two:
this post was hard to follow, and didn’t seem to be very serious. it also reads off as unfamiliar with the basics of the AI Alignment problem (the proposed changes to gpt-4 don’t concretely address many/any of the core Alignment concerns for reasons addressed by other commentors)
this post makes multiple (self-proclaimed controversial) claims that seem wrong or are not obvious, but doesn’t try to justify them in-depth.
overall, i’m getting the impression that your ideas are 1) wrong and you haven’t thought about them enough and/or 2) you arent communicating them well enough. i think the former is more likely, but it could also be some combination of the both. i think this means that:
you should try to become more familiar with the alignment field, and common themes surrounding proposed alignment solutions and their pitfalls
you should consider spending more time fleshing out your writing and/or getting more feedback (whether it be by talking to someone about your ideas, or sending out a draft idea for feedback)
Reverse engineering. Unclear if this is being pushed much anymore. 2022: Anthropic circuits, Interpretability In The Wild, Grokking mod arithmetic
FWIW, I was one of Neel’s MATS 4.1 scholars and I would classify 3⁄4 of Neel’s scholar’s outputs as reverse engineering some component of LLMs (for completeness, this is the other one, which doesn’t nicely fit as ‘reverse engineering’ imo). I would also say that this is still an active direction of research (lots of ground to cover with MLP neurons, polysemantic heads, and more)
[Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small
Quick feedback since nobody else has commented—I’m all for the AI Safety appearing “not just a bunch of crazy lunatics, but an actually sensible, open and welcoming community.”
But the spirit behind this post feels like it is just throwing in the towel, and I very much disapprove of that. I think this is why I and others downvoted too
Ehh… feels like your base rate of 10% for LW users who are willing to pay for a subscription is too high, especially seeing how the ‘free’ version would still offer everything I (and presumably others) care about. Generalizing to other platforms, this feels closest to Twitter’s situation with Twitter Blue, whose rates appear is far, far lower: if we be generous and say they have one million subscribers, then out of the 41.5 million monetizable daily active users they currently have, this would suggest a base rate of less than 3%.
Thanks for the writeup!
Small nitpik: typo in “this indeed does not seem like an attitude that leads to go outcomes”
I’m not sure if you’ve seen it or not, but here’s a relevant clip where he mentions that they aren’t training GPT-5. I don’t quite know how to update from it. It doesn’t seem likely that they paused from a desire to conduct more safety work, but I would also be surprised if somehow they are reaching some sort of performance limit from model size.
However, as Zvi mentions, Sam did say:
“I think we’re at the end of the era where it’s going to be these, like, giant, giant models...We’ll make them better in other ways”
The increased public attention towards AI Safety risk is probably a good thing. But, when stuff like this is getting lumped in with the rest of AI Safety, it feels like the public-facing slow-down-AI movement is going to be a grab-bag of AI Safety, AI Ethics, and AI… privacy(?). As such, I’m afraid that the public discourse will devolve into “Woah-there-Slow-AI” and “GOGOGOGO” tribal warfare; from the track record of American politics, this seems likely—maybe even inevitable?
More importantly, though, what I’m afraid of is that this will translate into adversarial relations between AI Capabilities organizations and AI Safety orgs (more generally, that capabilities teams will become less inclined to incorporate safety concerns in their products).
I’m not actually in an AI organization, so if someone is in one and has thoughts on this dynamic happening/not happening, I would love to hear.
Sheesh. Wild conversation. While I felt Lex was often missing the points Eliezer was saying, I’m glad he gave him the space and time to speak. Unfortunately, it felt like the conversation would keep moving towards reaching a super critical important insight that Eliezer wanted Lex to understand, and then Lex would just change the topic onto something else, and then Eliezer just had to begin building towards a new insight. Regardless, I appreciate that Lex and Eliezer thoroughly engaged with each other; this will probably spark good dialogue and get more people interested in the field. I’m glad it happened.
For those who are time constrained and wondering what is in it: Lex and Eliezer basically cover a whole bunch of high-level points related to AI not-kill-everyone-ism, delving into various thought experiments and concepts which formulate Eliezer’s worldview. Nothing super novel that you probably haven’t heard of if you’ve been following the field for some time.
I’m glad to see this. Some initial thoughts about the control safety case:
I don’t get how the “monitoring game” is different from the red-team game? Is it because the blue team is limited to only considering protocols that use just monitoring? Or do you not expect to play a full red-team simulation (this would be worrisome)
Option 2 (honeypotting) feels doomed (in isolation). This primarily stems from my belief that making situations that are indistinguishable from deployment is probably hard—the model has control over state in very broad and hard-to-predict ways, and this can implicitly encode that it is in deployment (for instance, the model can see code that it itself has written and would’ve taken a long time to write, and realize it has likely been deployed for some time). You can still run honeypotting tests using this newer distribution, but then you have to hope that you are running honeypotting frequently enough to catch when the model is confident it is in deployment.
The elicitation problem, and the broader problem of justifying that the red-blue teaming has been sufficient, indeed feel quite hard. This is one reason as to why you might expect that control cases can only work for human-level AI—if your ‘elicitation protocol’ just ends up being ‘use attack strategies made primarily by humans’.
There are related problems in justifying how many affordances to give to the red team. For instance, decisions related to if the red team should be able to give/train into the model ‘coordination signals’ are likely important, but might not be easy to make. Past research has been maximally conservative, but this might not be realistic.
The limitation that you are measuring P(catastrophe | misaligned) rather than just P(catastrophe) is fairly large, and I would be interested in discussion on how important this is to others. More broadly, I don’t feel great about our chances of bounding P(catastrophe) for reasons such as this.