Seems like Unicode officially added a “person being paperclipped” emoji:
Here’s how it looks in your browser: 🙂↕️
Whether they did this as a joke or to raise awareness of AI risk, I like it!
Seems like Unicode officially added a “person being paperclipped” emoji:
Here’s how it looks in your browser: 🙂↕️
Whether they did this as a joke or to raise awareness of AI risk, I like it!
Hey,
On anti-trust laws, see this comment. I also hope to have more to share soon
I asked Claude how relevant this is to protecting something like a H100, here are the parts that seem most relevant from my limited understanding:
1. Reading (not modifying) data from antifuse memory in a Raspberry Pi RP2350 microcontroller
2. Using Focused Ion Beam (FIB) and passive voltage contrast to extract information
3D Transistor Structures: Modern 5nm chips use FinFET or GAAFET 3D structures rather than planar transistors. The critical parts are buried within the structure, making them fundamentally more difficult to access without destroying them.
Atomic-Scale Limitations: At 5nm, we’re approaching atomic limits (silicon atoms are ~0.2nm). The physics of matter at this scale creates fundamental boundaries that better equipment cannot overcome.
Ion Beam Physics: Even with perfect equipment, ion beams create interaction volumes and damage zones that become proportionally larger compared to the target features at smaller nodes.
Thanks! Is this true for a somewhat-modern chip that has at least some slight attempt at defense, or more like the chip on a raspberry pi?
(Could you link to the context?)
Patching security problems in big old organizations involves problems that go a lot beyond “looking at code and changing it”, especially if aiming for a “strong” solution like formal verification.
TL;DR: Political problems, code that makes no sense, problems that would be easy to fix even with a simple LLM that isn’t specialized on improving security.
The best public resource I know is about this is Recoding America.
Some examples iirc:
Not having a clear primary key to identify people with.
Having a website (a form) that theoretically works but doesn’t run on any browser that people actually use.
Having a security component which is supposed to catch fake submissions of forms, but is way more likely to catch real submissions (it is imo net negative).
I also learned some surprising things from working on fixing/rewriting a major bank in Israel. I can’t share such juicy stories as Recoding America publicly, but here are some that I can:
“written in kobol” is maybe ~1% of the problem and imo not an interesting pain point to focus on
Many systems are microservices (much harder to define the expected functionality of an async system)
Different parts of the bank are written using different technologies
An example problem you’d find in the code is “representing amounts of money using javascript floats”
It’s not complicated to fix, technically
But people might say “we’re used to using floats, why change it?”
Making changes might mean they’ll have to run manual tests, so they’ll be against it
Some teams just don’t like having others touch their code, and maybe some care about their job security
Another example easy thing to fix is “use enums”
I’ve seen more than one conversations where “agreeing to use an enum” was a big political debate
Sometimes, to understand a system, I’d do a combination of “looking at the code”, “looking at the docs (word documents with the design)”, and “talking to one or more of the people in charge of the system” (sometimes different people have different parts of the picture)
[written with the hope that orgs trying to patch security problems will do well]
I want the tool to proactively suggest things while working on the document, optimizing for “low friction for getting lots of comments from the LLM”. The tool you suggested does optimize for this property very well
This is very cool, thanks!
I’m tempted to add Claude support
It isn’t exactly what I’m going for. Example use cases I have in mind:
“Here’s a list of projects I’m considering working on, and I’m adding curxes/considerations for each”
“Here’s my new alignment research agenda” (can an AI suggest places where this research is wrong? Seems like checking this would help the Control agenda?)
“Here’s a cost-effectiveness analysis of an org”
Things I’d suggest to an AI lab CISO if we had 5 minutes to talk
I think there are projects that can prepare the lab for moving to an air gapped network (protecting more than model weights) which would be useful to start early, would have minimal impact on developer productivity, and could be (to some extent) delegated[1]
Example categories of such projects:
Projects that take serial time but can be done without the final stage that actually hurts developer productivity
Toy example: Add extra ethernet cables to the building but don’t use them yet
Reduce uncertainty about the problems that will be caused by a future security measure
Toy example: Prepare for a (partially?) air gapped network by monitoring (with consent[2]) which domains employees use and finding alternatives to them, e.g:
Wikipedia --> Download it
Social media --> Buy some employees a personal use computer, see if they like it?
… each domain becomes a project to prioritize and delegate, hopefully
Projects that require “product market fit” with the engineers
Toy example: The lab wants a secure[3] way to access model weights[4]. They can try an MVP solution (github PRs?), get user feedback (“too much friction!”), and work on the next draft while the users go back to accessing the weights however they want.
Note how much more it would hurt productivity if we’d wait with this project until security became critical and we’d have to force the engineers to use whatever solution we could come up with quickly. This is a common property of many projects I’d suggest.
I’m assuming the CISO’s team has limited focus, but spending this focus on delegating projects is a good deal. I’m also assuming this is a problem they’re happy to solve with money.
I endorse communicating why you want to do this and getting employee agreement, not just randomly following them
e.g monitored
I’m aware this example is more focused on model weights, but it felt shorter to write than other product-market-fit examples. e.g I think “experiment with opening a new office for employees who like to WFH” is more realistic for an air gapped network but was longer for me to explain
I’m looking for an AI tool which feels like Google Docs but has an LLM proactively commenting/suggesting things.
(Is anyone else interested in something like this?)
This post helped me notice I have incoherent beliefs:
“If MAGMA self-destructs, the other labs would look at it with confusion/pity and keep going. That’s not a plan”
“MAGMA should self-destruct now even if it’s not leading!”
I think I’ve been avoiding thinking about this.
So what do I actually expect?
If OpenAI (currently in the lead) would say “our AI did something extremely dangerous, this isn’t something we know how to contain, we are shutting down and are calling other labs NOT to train over [amount of compute], and are not discussing the algorithm publicly because of fear the open source community will do this dangerous thing, and we need the government ASAP”, do I expect that to help?
Maybe?
Probably nation states will steal all the models+algorithm+slack as quickly as they can, probably a huge open source movement will protest, but it still sounds possible (15%?) that the major important actors would listen to this, especially if it was accompanies by demos or so?
What if Anthropic or xAI or DeepSeek (not currently in the lead) would shut down now?
...I think they would be ignored.
Does that imply I should help advance the capabilities of the lab most likely to act as you suggest?
Does this imply I should become a major player myself, if I can? If so, should I write on my website that I’m open to a coordinated pause?
Should I give up on being a CooperateBot, given the other players have made it so overwhelmingly clear they are happy to defect?
This is painful to think about, and I’m not sure what’s the right thing to do here.
Open to ideas from anyone.
Anyway, great post, thanks
OpenAI already have this in their charter:
We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”
Similar opinion here, also noting they didn’t run red-teaming and persuasion evals on the actually-final-version:
https://x.com/teortaxesTex/status/1885401111659413590
I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.
This seems important because Control has a central goal of being a fallback for that kind of failure mode.
I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).
What do I mean by Control generalizing out of distribution:
Our Control experiment might involve giving a model 100 leetcode problems, and we hope to show that (even with help) the model can’t successfully insert a back door without being caught by our Control setup[1].
Will this generalize to other leetcode problems? Probably.
How about more realistic coding tasks that involve confusing legacy code? Who knows.
How about more open-ended research tasks, like “here’s Ryan’s pitch for a faking-alignment paper, fill in the details and run the experiment”? I, eh, doubt it? Remember—we’re dealing with an AI that might be trying to scheme and mess up our research on purpose.
If we tested whether the AI can leak sensitive data without being caught, will it generalize to other bad things it might try to do?
And so on
Control’s soft guarantees could be somewhat achieved using RLHF
I’m writing this both to emphasise the similarity, but also to suggest nice things we could do with RLHF:
We generate scenarios where an AI might scheme or not scheme.
We might use a similar Control setup (“adding back doors to code”?), but the monitoring (“control protocol”) doesn’t have to be cheap+fast+scalable enough to run in production.
Given e.g 100 scenarios, use RLHF to train on a fraction of them (e.g 20), and check if the RLHF generalized to the rest.
If it didn’t generalize, pretend you caught the AI red handed (investigate what happened, publish it publicly, don’t just train against these scenarios)
In general, I’d be excited about labs publishing a naive metric like “what fraction of our RLHF results were needed in order to generalize to the rest”, as well as examples for scenarios that the AI doesn’t seem to generalize to unless you train against them specifically.
For similar reasons, I’d be excited about “we got our Control protocol working on 20% of the scenarios and it successfully generalized to the other 80%” over “we fixed our Control protocol until it worked[2] on 100% of the scenarios we had”.
Oh, and for both RLHF and Control, I think we should be very interested in “what is the distribution they used”.
Replies I got to this:
“We will only use the Controlled AI to write pretty normal code, so the deployment setup is in-distribution”
Me: “So how would Control help Safety research more than Capabilities? Seems like labs will try to extract AI R&D work regardless.”
“Control will let us run an AI that is smart[3] enough to cause significant damage to humanity, and extracting coding-work from such an AI is valuable”
Me: “That scenario implies an AI that can scheme in many creative ways. The Control scenario above might have (metaphorically) found the “backdoored code” neuron. That doesn’t seem enough.”
My steelman of Redwood:
“Yes, the distribution of scheming examples is super important, here are 2 posts we wrote about it and you missed, and we’ll publish 5 more next week with 14 more distributions of tasks relevant for the kind of alignment research we plan to do with a Controlled AI” (seems like Redwood always have a post directly addressing my concerns..)
“We need to start somewhere, and it’s easy to start with backdoored code or with a simple scheming objective like making an HTTP request”
“It’s useful to focus on scheming (whether using Control or RLHF) because that leads to other great assumptions”
“People are already implementing those RLHF suggestions, here are 20 arxiv papers about it that you missed”
“Control gives us less-capabilities with more-failing-loudly, which is a tradeoff we like”
“We are trying to have Control give (partial) guarantees that are pretty uncorrelated with RLHF, just as another layer of safety, because it’s the best idea we have so far, not because we actually think this will prevent the AI from scheming in a somewhat general way”
An opinion from a former lawyer
[disclaimers: they’re not an anti trust lawyer and definitely don’t take responsibility for this opinion, nor do I. This all might maybe be wrong and we need to speak to an actual anti-trust lawyer to get certainty. I’m not going to put any more disclaimers here, I hope I’m not also misremembering something]
So,
Having someone from the U.S government sign that they won’t enforce anti trust laws isn’t enough (even if the president signs), because the (e.g) president might change their mind, or the next president might enforce it retroactively. This is similar to the current situation with Tiktok where Trump said he wouldn’t enforce the law that prevents Google from having Tiktok on their app store, but Google still didn’t put Tiktok back, probably because they’re afraid that someone will change their mind and retroactively enforce the law
I asked if the government (e.g president) could sign “we won’t enforce this, and if we change our mind we’ll give a 3 month notice”.
The former-lawyer’s response was to consider whether, in a case the president would change their mind immediately, this signature would hold up in court. He thinks that not, but couldn’t remember an example of something similar happening (which seems relevant)
If the law changes (for example, to exclude this letter), that works
(but it’s hard to pass such changes through congress)
If the letter is conditional on the law changing, that seems ok
My interpretation of this:
It seems probably possible to find a solution where signing this letter is legal, but we’d have to consult with an anti-trust lawyer.
[reminder that this isn’t legal advice, isn’t confident, is maybe misremembered, and so on]
To the extent you think Anthropic is a good actor, you shouldn’t be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors
I think an important part of how one becomes (and stays) a good actor is by being transparent about things like this.
Anthropic could at least avoid making substantially misleading statements
Yes. But also, I’m afraid that Anthropic might solve this problem by just making less statements (which seems bad). Still Yes
Hey,
In the article, you measured the MONA setup against a pure RL agent.
I’m curious about measuring MONA against the overseer-AI picking the next move directly[1]: The overseer-AI probably won’t[2] reward hack more than the MONA setup, but it’s unclear to me if it would also have worse performance.
I could imagine discovering the myopic MONA agent converging on
Picking whatever the overseer-AI would pick (since those actions would seem “obviously correct” to the overseer AI and result in the most reward)
Picking actions that seem impressive but are useless (bad! we could get better performance, still with ~no reward hacking, by asking the overseer directly[1])
Picking clever useful actions that the overseer wouldn’t have thought of (win!)
I’d give the overseer-AI scaffolding to make it more comparable to the MONA setup, such as the ability to observe the consequences of its own actions and roll them back if they seem bad.
MONA is trying to use the overseer-AI to put an upper bound on how much reward hacking happens, if I understood correctly. If anyone runs this experiment, I’d still validate that assumption, since it seems easy (the paper already has a way to measure reward hacking).
nit: I wouldn’t use a prediction market as an overseer because markets are often uninterpretable to humans, which would miss some of the point[1].
“we show how to get agents whose long-term plans follow strategies that humans can predict”. But maybe no single human actually understands the strategy. Or maybe the traders are correctly guessing that the model’s steps will somehow lead to whatever is defined as a “good outcome”, even if they don’t understand how, which has similar problems to the RL reward from the future that you’re trying to avoid.
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?
The interesting application MONA seems to be when the myopic RL agent is able to produce better suggestions than the overseer
Edit: I elaborated
Plus maybe let the overseer observe the result and say “oops” and roll back that action, if we can implement a rollback in this context
So FYI, the emoji I posted above was actually AI generated.
The real one looks like this (on my browser) :
I still think the original one looks a lot like a paperclip, but I cheated and exaggerated this in the April Fool’s version