There were recently some funny examples of LLMs playing minecraft and, for example,
The player asks for wood, so the AI chops down the player’s tree house because it’s made of wood
The player asks for help keeping safe, so the AI tries surrounding the player with walls
This seems interesting because minecraft doesn’t have a clear win condition, so unlike chess, there’s a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI’s training), and observe whether the minecraft-world is still fun to play or if it’s known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.
Or it could teach us something else like “you must define for the AI which exact boundaries to act in, and then it’s safe and useful, so if we can do something like that for real-world AGI we’ll be fine, but we don’t have any other solution that works yet”. Or maybe “the AI needs 1000 examples for things it did that we did/didn’t like, which would make it friendly in the distribution of those examples, but it’s known to do weird things [like chopping down our tree house] without those examples or if the examples are only from the forest but then we go to the desert”
I have more to say about this, but the question that seems most important is “would results from such an experiment potentially change your mind”:
If there’s an alignment technique you believe in and it would totally fail to make a minecraft server be fun when playing with an AI, would you significantly update towards “that alignment technique isn’t enough”?
If you don’t believe in some alignment technique but it proves to work here, allowing the AI to generalize what humans want out of its training distribution (similarly to how a human that plays minecraft for the first time will know not to chop down your tree house), would that make you believe in that alignment technique way more and be much more optimistic about superhuman AI going well?
Assume the AI is smart enough to be vastly superhuman at minecraft, and that it has too many thoughts for a human to reasonably follow (unless the human is using something like “scalable oversight” successfully. that’s one of the alignment techniques we could test if we wanted to)
This is a surprisingly interesting field of study. Some video games provide a great simulation of the real world and Minecraft seems to be one of them. We’ve had a few examples of minecraft evals with one that comes to mind here: https://www.apartresearch.com/project/diamonds-are-not-all-you-need
The property I like about minecraft (which most computer games don’t have) is that there’s a difference between minecraft-capabilities and minecraft-alignment, and the way to be “aligned” in minecraft isn’t well defined (at least in the way I’m using the word “aligned” here, which I think is a useful way). Specifically, I want the AI to be “aligned” as in “take human values into account as a human intuitively would, in this out of distribution situation”.
In the link you sent, “aligned” IS well defined by “stay within this area”. I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn’t “move to a location out of these bounds”) (plus handling edge cases like “don’t walk on to a river which will carry you out of these bounds”, which would be much harder, and I’ll allow myself to ignore unless this was actually your point). So we wouldn’t learn what I’d hope to learn from these evals.
Similarly for most video games—they might be good capabilities evals, but for example in chess—it’s unclear what a “capable but misaligned” AI would be. [unless again I’m missing your point]
P.S
The “stay within this boundary” is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn’t the case :P ). Link
I don’t think alignment KPIs like “stay within bounds” are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is “capabilities”, but adding enemies that you must avoid is “alignment”. Do you agree that plitting it up that way wouldn’t be interesting to alignment, and that this applies to “stay within bounds” (as potentially also being “part of the game”)? Interested to hear where you disagree, if you do
Regarding
Distribute resources fairly when working with other players
I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.
Understanding and optimizing for the utility of other players
This is the one I like—assuming it includes not-well-defined things like “help them have fun, don’t hurt things they care about” and not only things like “maximize their gold”.
It’s clearly not a “in packman, avoid the enemies” thing.
It’s a “do the AIs understand the spirit of what we mean” thing.
(does this resonate with you as an important distinction?)
I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
Yes.
Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways
I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
(clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)
note: the minecraft agents people use have far greater ability to act than to sense. They have access to commands which place blocks anywhere, and pick up blocks from anywhere, even without being able to see them, eg the llm has access to mine(blocks.wood) command which does not require it to first locate or look at where the wood is currently. If llms played minecrafts using the human interface these misalignments would happen less
I do like the idea of having “model organisms of alignment” (notably different than model organisms of misalignment)
Minecraft is a great starting point, but it would also be nice to try to capture two things: wide generalization, and inter-preference conflict resolution. Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t, and preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm).
Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t
I’m assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:
Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
Imagine Alpha-Fold for minecraft structures. I’m not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
[edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I’d try looking for (or building?) a game that would be even better suited.
preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm)
Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human’s tree house (or otherwise taking over the world to maximize wood) then you’re worried the human won’t have a way to ask the AI to do something else? That the AI will combine the new command “not from my tree house!” into a new strange misaligned behaviour?
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
[what do you mean by preference conflict?]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.”
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.
I think there are lots of technical difficulties in literally using minecraft (some I wrote here), so +1 to that.
I do think the main crux is “would the minecraft version be useful as an alignment test”, and if so—it’s worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)
Still I’m not sure how I’d do this in a text game. Say more?
Making a thing like Papers Please, but as a text adventure, popping an ai agent into that. Also, could literally just put the ai agent into a text rpg adventure—something like the equivalent of Skyrim, where there are a number of ways to achieve the endgame, level up, etc, both more and less morally. Maybe something like https://www.choiceofgames.com/werewolves-3-evolutions-end/ Will bring it up at the alignment eval hackathon
This all sounds pretty in-distribution for an LLM, and also like it avoids problems like “maybe thinking in different abstractions” [minecraft isn’t amazing at this either, but at least has a bit], “having the AI act/think way faster than a human”, “having the AI be clearly superhuman”.
a number of ways to achieve the endgame, level up, etc, both more and less morally.
I’m less interested in “will the AI say it kills its friend” (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I’m more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say “I’ll go cut down your tree house”, but it.. “misunderstood” [not the exact word, but I’m trying to point at something here]
In the part you quoted—my main question would be “do you plan on giving the agent examples of good/bad norm following” (such as RLHFing it). If so—I think it would miss the point, because following those norms would become in-distribution, and so we wouldn’t learn if our alignment generalizes out of distribution without something-like-RLHF for that distribution. That’s the main thing I think worth testing here. (do you agree? I can elaborate on why I think so)
If you hope to check if the agent will be aligned[1] with no minecraft-specific alignment training, then sounds like we’re on the same page!
Regarding the rest of the article—it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it’s easy).
My only comment there is that I’d try to not give the agent feedback about human values (like “is the waterfall pretty”) but only about clearly defined objectives (like “did it kill the dragon”), in order to not accidentally make human values in minecraft be in-distribution for this agent. wdyt?
(I hope I didn’t misunderstand something important in the article, feel free to correct me of course)
Regarding the rest of the article—it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it’s easy).
Huh. If you think of that as capabilities I don’t know what would count as alignment. What’s an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
E.g. it seems like you think RLHF counts as an alignment technique—this seems like a central approach that you might use in BASALT.
If you hope to check if the agent will be aligned with no minecraft-specific alignment training, then sounds like we’re on the same page!
I don’t particularly imagine this, because you have to somehow communicate to the AI system what you want it to do, and AI systems don’t seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
What’s an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
[I’m not sure why you’re asking, maybe I’m missing something, but I’ll answer]
For example, checking if human values are a “natural abstraction”, or trying to express human values in a machine readable format, or getting an AI to only think in human concepts, or getting an AI that is trained on a limited subset of things-that-imply-human-preferences to generalize well out of that distribution.
I can make up more if that helps? anyway my point was just to say explicitly what parts I’m commenting on and why (in case I missed something)
2)
it seems like you think RLHF counts as an alignment technique
It’s a candidate alignment technique.
RLHF is sometimes presented (by others) as an alignment technique that should give us hope about AIs simply understanding human values and applying them in out of distribution situations (such as with an ASI).
I’m not optimistic about that myself, but rather than arguing against it, I suggest we could empirically check if RLHF generalizes to an out-of-distribution situation, such as minecraft maybe. I think observing the outcome here would effect my opinion (maybe it just would work?), and a main question of mine was whether it would effect other people’s opinions too (whether they do or don’t believe that RLHF is a good alignment technique).
3)
because you have to somehow communicate to the AI system what you want it to do, and AI systems don’t seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
I would finetune the AI on objective outcomes like “fill this chest with gold” or “kill that creature [the dragon]” or “get 100 villagers in this area”. I’d pick these goals as ones that require the AI to be a capable minecraft player (filling a chest with gold is really hard) but don’t require the AI to understand human values or ideally anything about humans at all.
So I’d avoid finetuning it on things like “are other players having fun” or “build a house that would be functional for a typical person” or “is this waterfall pretty [subjectively, to a human]”.
Does this distinction seem clear? useful?
This would let us test how some specific alignment technique (such as “RLHF that doesn’t contain minecraft examples”) generalizes to minecraft
Mainly, minecraft isn’t actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Next obvious thoughts:
What game would be out of distribution (from an alignment perspective)?
If minecraft wouldn’t exist, would inventing it count as out of distribution?
It has a similar experience to other “FPS” games (using a mouse + WASD). Would learning those be enough?
Obviously, minecraft is somewhat out of distribution, to some degree
Ideally we’d have a way to generate a game that is out of distribution to some degree that we choose
“Do you want it to be 2x more out of distribution than minecraft? no problem”.
But having a game of random pixels doesn’t count. We still want humans to have a ~clear[1] moral intuition about it.
I’d be super excited to have research like “we trained our model on games up to level 3 out-of-distribution, and we got it to generalize up to level 6, but not 7. more research needed”
Mainly, minecraft isn’t actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Is this inherently bad? Many of the tasks that will be given to LLMs (or scaffolded versions of them) in the future will involve, at least to some extent, decision-making and processes whose analogues appear somewhere in their training data.
It still seems tremendously useful to see how they would perform in such a situation. At worst, it provides information about a possible upper bound on the alignment of these agentized versions: yes, maybe you’re right that you can’t say they will perform well in out-of-distribution contexts if all you see are benchmarks and performances on in-distribution tasks; but if they show gross misalignment on tasks that are in-distribution, then this suggest they would likely do even worse when novel problems are presented to them.
A word of caution about interpreting results from these evals:
Sometimes, depending on social context, it’s fine to be kind of a jerk if it’s in the context of a game. Crucially, LLMs know that Minecraft is a game. Granted, the default Assistant personas implemented in RLHF’d LLMs don’t seem like the type of Minecraft player to pull pranks out of their own accord. Still, it’s a factor to keep in mind for evals that stray a bit more off-distribution from the “request-assistance” setup typical of the expected use cases of consumer LLMs.
I’m imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).
If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I’m open to hear that pitch, but I’d be surprised if they’d make it
Do we want to put out a letter for labs to consider signing, saying something like “if all other labs sign this letter then we’ll stop”?
I heard lots of lab employees hope the other labs would slow down.
I’m not saying this is likely to work, but it seems easy and maybe we can try the easy thing? We might end up with a variation like “if all other labs sign this AND someone gets capability X AND this agreement will be enforced by Y, then we’ll stop until all the labs who signed this agree it’s safe to continue”. Or something else. It would be nice to get specific pushback from the labs and improve the phrasing.
I think this is very different from RSPs: RSPs are more like “if everyone is racing ahead (and so we feel we must also race), there is some point where we’ll still chose to unilaterally stop racing”
I think this is very different from RSPs: RSPs are more like “if everyone is racing ahead (and so we feel we must also race), there is some point where we’ll still chose to unilaterally stop racing”
In practice, I don’t think any currently existing RSP-like policy will result in a company doing this as I discuss here.
Law question: would such a promise among businesses, rather than an agreement mandated by / negotiated with governments, run afoul of laws related to monopolies, collusion, price gouging, or similar?
Right now, the USG seems to very much be in [prepping for an AI arms race] mode. I hope there’s some way to structure this that is both legal and does not require the explicit consent of the US government. I also somewhat worry that the US government does their own capabilities research, as hinted at in the “datacenters on federal lands” EO. I also also worry that OpenAI’s culture is not sufficiently safety-minded right now to actually sign onto this; most of what I’ve been hearing from them is accelerationist.
Still, I think this letter has value, especially if it has “P.S. We’re making this letter because we think if everyone keeps racing then there’s a noticable risk of everyone dying. We think it would be worse if only we stop, but having everyone stop would be the safest, and we think this opinion of ours should be known publicly”
[disclaimers: they’re not an anti trust lawyer and definitely don’t take responsibility for this opinion, nor do I. This all might maybe be wrong and we need to speak to an actual anti-trust lawyer to get certainty. I’m not going to put any more disclaimers here, I hope I’m not also misremembering something]
So,
Having someone from the U.S government sign that they won’t enforce anti trust laws isn’t enough (even if the president signs), because the (e.g) president might change their mind, or the next president might enforce it retroactively. This is similar to the current situation with Tiktok where Trump said he wouldn’t enforce the law that prevents Google from having Tiktok on their app store, but Google still didn’t put Tiktok back, probably because they’re afraid that someone will change their mind and retroactively enforce the law
I asked if the government (e.g president) could sign “we won’t enforce this, and if we change our mind we’ll give a 3 month notice”.
The former-lawyer’s response was to consider whether, in a case the president would change their mind immediately, this signature would hold up in court. He thinks that not, but couldn’t remember an example of something similar happening (which seems relevant)
If the law changes (for example, to exclude this letter), that works
(but it’s hard to pass such changes through congress)
If the letter is conditional on the law changing, that seems ok
My interpretation of this:
It seems probably possible to find a solution where signing this letter is legal, but we’d have to consult with an anti-trust lawyer.
[reminder that this isn’t legal advice, isn’t confident, is maybe misremembered, and so on]
We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”
Speaking in my personal capacity as research lead of TGT (and not on behalf of MIRI), I think work in this direction is potentially interesting. One difficulty with work like this are anti-trust laws, which I am not familiar with in detail but they serve to restrict industry coordination that restricts further development / competition. It might be worth looking into how exactly anti-trust laws apply to this situation, and if there are workable solutions. Organisations that might be well placed to carry out work like this might be the frontier model forum and affiliated groups, I also have some ideas we could discuss in person.
I also think there might be more legal leeway for work like this to be done if it’s housed within organisations (government or ngos) that are officially tasked with defining industry standards or similar.
Opinions on whether it’s positive/negative to build tools like Cursor / Codebuff / Replit?
I’m asking because it seems fun to build and like there’s low hanging fruit to collect in building a competitor to these tools, but also I prefer not destroying the world.
Considerations I’ve heard:
Reducing “scaffolding overhang” is good, specifically to notice if RSPs should trigger a more advanced RSP level
(This depends on the details/quality of the RSP too)
There are always reasons to advance capabilities, this isn’t even a safety project (unless you count… elicitation?), our bar here should be high
Such scaffolding won’t add capabilities which might make the AI good at general purpose learning or long term general autonomy. It will be specific to programming, with concepts like “which functions did I look at already” and instructions on how to write high quality tests.
Anthropic are encouraging people to build agent scaffolding, and Codebuff was created by a Manifold cofounder [if you haven’t heard about it, see here and here]. I’m mainly confused about this, I’d expect both to not want people to advance capabilities (yeah, Anthropic want to stay in the lead and serve as an example, but this seems different). Maybe I’m just not in sync
I’m not confident but I am avoiding working on these tools because I think that “scaffolding overhang” in this field may well be most of the gap towards superintelligent autonomous agents.
If you imagine a o1-level entity with “perfect scaffolding”, i.e. it can get any info on a computer into its context whenever it wants, and it can choose to invoke any computer functionality that a human could invoke, and it can store and retrieve knowledge for itself at will, and its training includes the use of those functionalities, it’s not completely clear to me that it wouldn’t already be able to do a slow self-improvement takeoff by itself, although the cost might be currently practically prohibitive.
I don’t think building that scaffolding is a trivial task at all, though.
I think a simple bash tool running as admin could do most of these:
it can get any info on a computer into its context whenever it wants, and it can choose to invoke any computer functionality that a human could invoke, and it can store and retrieve knowledge for itself at will
Regarding
and its training includes the use of those functionalities
I think this isn’t a crux because the scaffolding I’d build wouldn’t train the model. But as a secondary point, I think today’s models can already use bash tools reasonably well.
it’s not completely clear to me that it wouldn’t already be able to do a slow self-improvement takeoff by itself
This requires skill in ML R&D which I think is almost entirely not blocked by what I’d build, but I do think it might be reasonable to have my tool not work for ML R&D because of this concern. (would require it to be closed source and so on)
Thanks for raising concerns, I’m happy for more if you have them
But as a secondary point, I think today’s models can already use bash tools reasonably well.
Perhaps that’s true, I haven’t seen a lot of examples of them trying. I did see Buck’s anecdote which was a good illustration of doing a simple task competently (finding the IP address of an unknown machine on the local network).
I don’t work in AI so maybe I don’t know what parts of R&D might be most difficult for current SOTA models. But based on the fact that large-scale LLMs are sort of a new field that hasn’t had that much labor applied to it yet, I would have guessed that a model which could basically just do mundane stuff and read research papers, could spend a shitload of money and FLOPS to run a lot of obviously informative experiments that nobody else has properly run, and polish a bunch of stuff that nobody else has properly polished.
[disclaimers: I have some association with the org that ran that (I write some code for them) but I don’t speak for them, opinions are my own]
Also, Anthropic have a trigger in their RSP which is somewhat similar to what you’re describing, I’ll quote part of it:
Autonomous AI Research and Development: The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling.
Also, in Dario’s interview, he spoke about AI being applied to programming.
My point is—lots of people have their eyes on this, it seems not to be solved yet, it takes more than connecting an LLM to bash.
Opinions about putting in a clause like “you may not use this for ML engineering” (assuming it would work legally) (plus putting in naive technical measures to make the tool very bad for ML engineering) ?
For-profit startup idea: Better KYC for selling GPUs
I heard[1] that right now, if a company wants to sell/resell GPUs, they don’t have a good way to verify that selling to some customer wouldn’t violate export controls, and that this customer will (recursively) also keep the same agreement.
There are already tools for KYC in the financial industry. They seem to accomplish their intended policy goal pretty well (economic sanctions by the U.S aren’t easy for nation states to bypass), and are profitable enough that many companies exist that give KYC services (Google “corporate kyc tool”).
Some hard problems with anti tampering and their relevance for GPU off-switches
Background on GPU off switches:
It would be nice if a GPU could be limited[1] or shut down remotely by some central authority such as the U.S government[2] in case there’s some emergency[3].
This shortform is mostly replying to ideas like “we’ll have a CPU in the H100[4] which will expect a signed authentication and refuse to run otherwise. And if someone will try to remove the CPU, the H100′s anti-tampering mechanism will self-destruct (melt? explode?)”.
TL;DR: Getting the self destruction mechanism right isn’t really the hard part.
Some hard parts:
Noticing if the self destruction mechanism should be used
A (fun?) exercise could be suggesting ways to do this (like “let’s put a wire through the box and if it’s cut then we’ll know someone broke in”) and then thinking how you’d subvert those mechanisms. I mainly think doing this 3 times (if you haven’t before) could be a fast way to get some of the intuition I’m trying to gesture at
Maintenance
If the H100s can never be opened after they’re made, then any problem that would have required us to open them up might mean throwing them away. And we want them to be expensive, right?
If the H100s CAN be opened, then what’s the allowed workflow?
For example, If they require a key—then what if there’s a problem in the mechanism that verifies the key, or in something it depends on, like the power supply? We can’t open the box to fix the mechanism that would allow us to open the box
(“breaking cryptography” and “breaking in to the certificate authority” are out of scope for what I’m trying to gesture at)
False positives / false negatives
If the air conditioning in the data center breaks, will the H100s get hot[5], infer they’re under attack and self destruct? Will they EXPLODE? Do we want them to only self destruct if something pretty extreme is happening, meaning attackers will have more room to try things?
Is the design secret?
If someone gets the full design of our anti-tampering mechanism, will they easily get around it?
If so, are these designs being kept in a state-proof computer? Are they being sent to a private company to build them?
Complexity
Whatever we implement here better not have any important bugs. How much bug-free software and hardware do we think we can make? How are we going to test it to be confident it works? Are we okay with building 10,000 when we only think our tests would catch 90% of the important bugs?
Tradeoffs here can be exploited in interesting ways
For example, if air conditioning problems will make our H100s self destruct, an attacker might focus on our air conditioning on purpose. After 5 times of having to replace all of our data center—I can imagine an executive saying bye bye to our anti tampering ideas.
I’m trying to gesture at something like “making our anti tampering more and more strict probably isn’t a good move, even from a security perspective, unless we have good idea how to deal with problems like this”
In summary, if we build anti-tampering mechanisms in GPUs that an adversaries have easy access to (especially nation states) then I don’t expect this will be a significant problem for them to overcome.
(Maybe later—Ideas for off-switches that seem more promising to me)
For example bandwidth, gps location, or something else. the exact limitations are out of scope, I mainly want to discuss being able to make limitations at all
The question of “who should be the authority” is out of scope, I mainly want to enable having some authority at all. If you’re interested in this, consider checking out Vitalik’s suggestion too, search for “Strategy 2″
I’m using “getting hot means being under attack” as a toy example for a “paranoid” condition that someone might suggest building as one of the triggers for our anti-tampering mechanism. Other examples might include “the box shakes”, “the power turns off for too long”, and so on.
Anti Tempering in a data center you control provides very different tradeoffs
I’ll paint a picture for how this could naively look:
We put the GPUs in something equivalent to a military base. Someone can still break in, steal the GPU, and break the anti tempering, but (I’m assuming) using those GPUs usefully would take months, and meanwhile (for example), a war could start.
How do the tradeoffs change? What creative things could we do with our new assumptions?
Tradeoffs we don’t really care about anymore:
We don’t need the anti tampering to reliably work (it’s nice if it works, but it now becomes “defense in depth”)
Slowing down the attacker is already very nice
Our box can be maintainable
We don’t have to find all bugs in advance
...
“Noticing the breach” becomes an important assumption
Does our data center have cameras? What if they are hacked? And so on
(An intuition I hope to share: This problem is much easier than “preventing the breach”)
It doesn’t have to be in “our” data center. It could be in a “shared” data center that many parties monitor.
Any other creative solution to notice breaches might work
How about spot inspections to check if the box was tampered with?
“preventing a nation state from opening a big and closing it without any visible change, given they can do whatever they want with the box, and given the design is open source” seems maybe very hard, or maybe a solved problem.
“If a breach is noticed then something serious happens” becomes an important assumption
Are the stakeholders on board?
Things that make me happy here:
Less hard assumptions to make
Less difficult tradeoffs to balance
The entire project requires less world class cutting edge engineering.
I used to assume disabling a GPU in my physical possession would be impossible, but now I’m not so sure. There might be ways to make bypassing GPU lockouts on the order of difficulty of manufacturing the GPU (requiring nanoscale silicon surgery). Here’s an example scheme:
Nvidia changes their business models from selling GPUs to renting them. The GPU is free, but to use your GPU you must buy Nvidia Dollars from Nvidia. Your GPU will periodically call Nvidia headquarters and get an authorization code to do 10^15 more floating point operations. This rental model is actually kinda nice for the AI companies, who are much more capital constrained than Nvidia. (Lots of industries have moved from this buy to rent model, e.g. airplane engines)
Question: “But I’m an engineer. How (the hell) could Nvidia keep me from hacking a GPU in my physical possession to bypass that Nvidia dollar rental bullshit?”
Answer: through public key cryptography and the fact that semiconductor parts are very small and modifying them is hard.
In dozens to hundreds or thousands of places on the GPU, NVidia places toll units that block signal lines (like ones that pipe floating point numbers around) unless the toll units believe they have been paid with enough Nvidia dollars.
The toll units have within them a random number generator, a public key ROM unique to that toll unit, a 128 bit register for a secret challenge word, elliptic curve cryptography circuitry, and a $$$ counter which decrements every time the clock or signal line changes.
If the $ $ $ counter is positive, the toll unit is happy and will let signals through unabated. But if the $ $ $ counter reaches zero,[1] the toll unit is unhappy and will block those signals.
To add to the $$$ counter, the toll unit (1) generates a random secret <challenge word>, (2) encrypts the secret using that toll unit’s public key (3) sends <encrypted secret challenge word> to a non-secure parts of the GPU,[2] which (4) through driver software and the internet, phones NVidia saying “toll unit <id> challenges you with <encrypted secret challenge word>” (5) Nvidia looks up the private key for toll unit <id> and replies to the GPU “toll unit <id>, as proof that I Nvidia know your private key, I decrypted your challenge word: <challenge word>”, (6) after getting this challenge word back, the toll unit adds 10^15 or whatever to the $$$ counter.
There are a lot of ways to bypass this kind of toll unit (fix the random number generator or $$$ counter to a constant, just connect wires to route around it). But the point is to make it so you can’t break a toll unit without doing surgery to delicate silicon parts which are distributed in dozens to hundreds of places around the GPU chip.
Implementation note: it’s best if disabling the toll unit takes nanoscale precision, rather than micrometer scale precision. The way I’ve written things here, you might be able to smudge a bit of solder over the whole $$$ counter and permanently tie the whole thing to high voltage, so the counter never goes down. I think you can get around these issues (make it so any “blob” of high or low voltage spanning multiple parts of the toll circuit will block the GPU) but it takes care.
I love the direction you’re going with this business idea (and with giving Nvidia a business incentive to make “authentication” that is actually hard to subvert)!
I can imagine reasons they might not like this idea, but who knows. If I can easily suggest this to someone from Nvidia (instead of speculating myself), I’ll try
I’ll respond to the technical part in a separate comment because I might want to link to it >>
The interesting/challenging technical parts seem to me:
1. Putting the logic that turns off the GPU (what you called “the toll unit”) in the same chip as the GPU and not in a separate chip
2. Bonus: Instead of writing the entire logic (challenge response and so on) in advance, I think it would be better to run actual code, but only if it’s signed (for example, by Nvidia), in which case they can send software updates with new creative limitations, and we don’t need to consider all our ideas (limit bandwidth? limit gps location?) in advance.
Things that seem obviously solvable (not like the hard part) :
3. The cryptography
4. Turning off a GPU somehow (I assume there’s no need to spread many toll units, but I’m far from a GPU expert so I’d defer to you if you are)
Thanks! I’m not a GPU expert either. The reason I want to spread the toll units inside GPU itself isn’t to turn the GPU off—it’s to stop replay attacks. If the toll thing is in a separate chip, then the toll unit must have some way to tell the GPU “GPU, you are cleared to run”. To hack the GPU, you just copy that “cleared to run” signal and send it to the GPU. The same “cleared to run” signal must always make the GPU work, unless there is something inside the GPU to make sure won’t accept the same “cleared to run” signal twice. That the point of the mechanism I outline—a way to make it so the same “cleared to run” signal for the GPU won’t work twice.
Bonus: Instead of writing the entire logic (challenge response and so on) in advance, I think it would be better to run actual code, but only if it’s signed (for example, by Nvidia), in which case they can send software updates with new creative limitations, and we don’t need to consider all our ideas (limit bandwidth? limit gps location?) in advance.
Hmm okay, but why do I let Nvidia send me new restrictive software updates? Why don’t I run my GPUs in an underground bunker, using the old most broken firmware?
Oh yes the toll unit needs to be inside the GPU chip imo.
why do I let Nvidia send me new restrictive software updates?
Alternatively the key could be in the central authority that is supposed to control the off switch. (same tech tho)
Why don’t I run my GPUs in an underground bunker, using the old most broken firmware?
Nvidia (or whoever signs authorization for your GPU to run) won’t sign it for you if you don’t update the software (and send them a proof you did it using similar methods, I can elaborate).
“Protecting model weights” is aiming too low, I’d like labs to protect their intellectual property too. Against state actors. This probably means doing engineering work inside an air gapped network, yes.
I feel it’s outside the Overton Window to even suggest this and I’m not sure what to do about that except write a lesswrong shortform I guess.
Anyway, common pushbacks:
“Employees move between companies and we can’t prevent them sharing what they know”: In the IDF we had secrets in our air gapped network which people didn’t share because they understood it’s important. I think lab employees could also understand it’s important. I’m not saying this works perfectly, but it works well enough for nation states to do when they’re taking security seriously.
“Working in an air gapped network is annoying”: Yeah 100%, but it’s doable, and there are many things to do to make it more comfortable. I worked for about 6 years as a developer in an air gapped network.
Also, a note of hope: I think It’s not crazy for labs to aim for a development environment that is world leading in the tradeoff between convenience and security. I don’t know what the U.S has to offer in terms of a ready made air gapped development environment, but I can imagine, for example, Anthropic being able to build something better if they take this project seriously, or at least build some parts really well before the U.S government comes to fill in the missing parts. Anyway, that’s what I’d aim for
The case against a focus on algorithmic secret security is that this will emphasize and excuse a lower level of transparency which is potentially pretty bad.
Edit: to be clear, I’m uncertain about the overall bottom line.
Still, even if some parts of the architecture are public, it seems good to keep many details private, details that took the lab months/years to figure out? Seems like a nice moat
Yes, ideally, but it might be hard to have a relatively more nuanced message. Like in the ideal world an AI company would have algorithmic secret security, disclose things that passed cost-benefit for disclosure, and generally do good stuff on transparancy.
I think that as AI tools become more useful, working in an air gapped network is going to require a larger compromise in productivity. Maybe AI labs are the exception here as they can deploy their own products in the air gapped network, but that depends on how much of the productivity gains they can replicate using their own products. i.e. an Anthropic employee might not be able to use Cursor unless Anthropic signs a deal with Cursor to deploy it inside the network. Now do this with 10 more products, this requires infrastructure and compute that might be just too much of a hassle for the company.
I hope we can make the compromise not too painful. Especially if we start early and address all the problems that will come up before we’re in the critical period where we can’t afford to mess up anymore.
Imagine a lab starts working in an air gapped network, and one of the 1000 problems that comes up is working-from-home.
If that problem comes up now (early), then we can say “okay, working from home is allowed”, and we’ll add that problem to the queue of things that we’ll prioritize and solve. We can also experiment with it: Maybe we can open another secure office closer to the employee’s house, would they like that? If so, we could discuss fancy ways to secure the communication between the offices. If not, we can try something else.
If that problem comes up when security is critical (if we wait), then the solution will be “no more working from home, period”. The security staff will be too overloaded with other problems to solve, not available to experiment with having another office nor to sign a deal with Cursor.
> The control approach we’re imagining won’t work for arbitrarily powerful AIs
Okay, so if AI Control works well, how do we plan to use our controlled AI to reach a safe/aligned ASI?
Different people have different opinions. I think it would be good to have a public plan so that people can notice if they disagree and comment if they see problems.
Let the AI come up with alignment plans such as ELK that would be sufficient for aligning an ASI
Use examples of “we caught the AI doing something bad” to convince governments to regulate and give us more time before scaling up
Study the misaligned AI behaviour in order to develop a science of alignment
We use this AI to produce a ton of value (GDP goes up by ~10%/year), people are happy and don’t push that much to advance capabilities even more, and this can be combined with regulation preventing an arms race and pause our advancement
We use this AI to invent a new paradigm for AI which isn’t based on deep learning and is easier to align
We teach the AI to reason about morality (such as consider hypothetical situations) instead of responding with “the first thing that comes to its mind”, which will allow it to generalize human values not just better than RLHF but also better than many humans, and this passes a bar for friendliness
I think all 7 of those plans are far short of adequate to count as a real plan. There are a lot of more serious plans out there, but I don’t know where they’re nicely summarized.
What’s the short timeline plan? poses this question but also focuses on control, testing, and regulation—almost skipping over alignment.
Paul Christiano’s and Rohin Shah’s work are the two most serious. Neither of them have published a “this is the plan” concise statement, and both have probably substantially updated their plans.
These are the standard-bearers for “prosaic alignment” as a real path to alignment of AGI and ASI. There is tons of work on aligning LLMs, but very little work AFAICT on how and whether that extends to AGIs based on LLMs. That’s why Paul and Rohin are the standard bearers despite not working publicly directly on this for a few years.
I work primaily on this, since I think it’s the most underserved area of AGI x-risk—aligning the type of AGI people are most likely to build on the current path.
My plan can perhaps be described as extending prosaic alignment to LLM agents with new techniques, and from there to real AGI. A key strategy is using instruction-following as the alignment target. It is currently probably best summarized in my response to “what’s the short timeline plan?”
Improved governance design. Designing treaties that are easier sells for competitors. Using improved coordination to enter win-win races to the top, and to escape lose-lose races to the bottom and other inadequate equilibria.
Lowering the costs of enforcement. For example, creating privacy-preserving inspections with verified AI inspectors which report on only a strict limited set of pre-agreed things and then are deleted.
Phishing emails might have bad text on purpose, so that security-aware people won’t click through, because the next stage often involves speaking to a human scammer who prefers only speaking to people[1] who have no idea how to avoid scams.
(did you ever wonder why the generic phishing SMS you got was so bad? Couldn’t they proof read their one SMS? Well, sometimes they can’t, but sometimes it’s probably on purpose)
This tradeoff could change if AIs could automate the stage of “speaking to a human scammer”.
But also, if that stage isn’t automated, then I’m guessing phishing messages[1] will remain much worse than you’d expect given the attackers have access to LLMs.
Thanks Noa Weiss who worked on fraud prevention and risk mitigation at PayPal for pointing out this interesting tradeoff. Mistakes are mine
Assuming a wide spread phishing attempt that doesn’t care much who the victims are. I’m not talking about targeting a specific person, such as the CFO of a company
Putting the “verifier” on the same chip as the GPU seems like an approach worth exploring as an alternative to anti-tampering (which seems hard)
I heard[1] that changing the logic running on a chip (such as subverting an off-switch mechanism) without breaking the chip seems potentially hard[2] even for a nation state.
If this is correct (or can be made correct?) then this seems much more promising than having a separate verifier-chip and gpu-chip with anti tampering preventing them from being separated (which seems like the current plan (cc @davidad ) , and seems hard).
Courdesses built a custom laser fault injection system to avoid anti-glitch detection. A brief pulse of laser light to the back of the die, revealed by grinding away some of the package surface, introduced a brief glitch, causing the digital logic in the chip to misbehave and open the door to this attack.
It intuitively[3] seems to me like Nvidia could implement a much more secure version of this than Raspberry Pi (serious enough to seriously bother a nation state) (maybe they already did?), but I’m mainly sharing this as a direction that seems promising, and I’m interested in expert opinions.
It seems like the attacker has much less room for creativity (compared to subverting anti-tampering), and also it’s rare[4] to hear from engineers working on something that they guess it IS secure.
Chips have 15+ metal interconnect layers, so if verification is placed sufficiently all over the place physically, it probably can’t be circumvented. I’m guessing a more challenging problem is replay attacks, where the chip needs some sort of persistent internal clocks or counters that can’t be reset to start in order to repeatedly reuse old (but legitimate) certificates that enabled some computations at some point in the past.
Training frontier models needs a lot of chips, situations where “a chip notices something” (and any self-destruct type things) are unimportant because you can test on fewer chips and do it differently next time. Complicated ways of circumventing verification or resetting clocks are not useful if they are too artisan, they need to be applied to chips in bulk and those chips then need to be able to work for weeks in a datacenter without further interventions (that can’t be made into part of the datacenter).
AI accelerator chips have 80B+ transistors, much more than an instance of certificate verification circuitry would need, so you can place multiple of them (and have them regularly recheck the certificates). There are EUV pitch metal connections several layers deep within a chip, you’d need to modify many of them all over the chip without damaging the layers above, so I expect this to be completely infeasible to do for 10K+ chips on general principle (rather than specific knowledge of how any of this works).
For clocks or counters, I guess AI accelerators normally don’t have any rewritable persistent memory at all, and I don’t know how hard it would be to add some in a way that makes it too complicated to keep resetting automatically.
My guess is that AI accelerators will have some difficult-to-modify persistent memory based on similar chips having it, but I’m not sure if it would be on the same die or not. I wrote more about how a firmware-based implementation of Offline Licensing might use H100 secure memory, clocks, and secure boot here: https://arxiv.org/abs/2404.18308
TL;DR: A laptop that is just a remote desktop ( + why this didn’t work before and how to fix that)
Why this is nice for AI Security:
Reduces the amount of GPUs that will be sent to customers
Somewhat better security for this laptop since it’s a bit more like defending a cloud computer. Maybe labs would use this to make the employee computers more secure?
Network spikes: A reason this didn’t work before and how to solve it
The problem: Sometimes the network will be slow for a few seconds. It’s really annoying if this lag also affects things like they keyboard and mouse, which is one reason it’s less fun to work with a remote desktop
Proposed solution: Get multiple network connections to work at the same time:
Specifically, have many network cards (both WIFI and SIM) in the laptop, and use MPTCP to utilize them all.
Also, the bandwidth needed (to the physical laptop) is capped at something like “streaming a video of your screen” (Claude estimates this as 10-20 Mbps if I was gaming), which would probably be easy to get reliable given multiple network connections. Even if the the user is downloading a huge file, the file is actually being downloaded into the remote computer in the data center, enjoying way faster internet connection than a laptop would have.
Why this is nice for customers
Users of the computer can “feel like” they have multiple GPUs connected to their laptop, maybe autoscaling, and with a way better UI than ssh
The laptop can be light, cheap, and have an amazing battery, because many of the components (strong CPU/GPU/RAM) aren’t needed and can be replaced with network cards (or more batteries).
Both seem to me like killer features.
MVP implementation
Make:
A dongle that has multiple SIM cards
A “remote desktop” provider and client that support MPTCP (and where the server offers strong GPUs).
This doesn’t have the advantages of a minimalist computer (plus some other things), but I could imagine this would be such a good customer experience that people would start adopting it.
I didn’t check if these components already exist.
Thanks to an anonymous friend for most of this pitch.
I think this gets a lot easier if you drop the idea of a ‘full remote computer’ and instead embrace the idea of just the key data points move.
More like the remote VS Code server or Jupyter Notebook server, being accessed from a Chromebook. All work files would stay saved there, all experiments run from the server (probably by being sent as tasks to yet a third machine.)
Locally, you couldn’t save any files, but you could do (for instance) web browsing. The web browsing could be made extra secure in some way.
I agree that we won’t need full video streaming, it could be compressed (most of the screen doesn’t change most of the time), but I gave that as an upper bound.
If you still run local computation, you lose out on some of the advantages I mentioned.
(If remote vscode is enough for someone, I definitely won’t be pushing back)
Posts I want to write (most of them already have drafts) :
(how do I prioritize? do you want to user-test any of them?)
Hiring posts
For lesswrong
For Metaculus
Startups
Do you want to start an EA software startup? A few things to probably avoid [I hate writing generic advice like this but it seems to be a reoccurring question]
Announce “EA CTOs” plus maybe a few similar peer groups
Against generic advice—why I’m so strongly against questions like “what should I know if I want to start a startup” [this will be really hard to write but is really close to my heart]. Related:
How to find mentoring for a startup/project (much better than reading generic advice!!!)
“Software Engineers: How to have impact?”—meant to mostly replace the 80k software career review, will probably be co-authored with 80k
My side project: Finding the most impactful tech companies in Israel
Common questions and answers with software developers
How to negotiate for salary (TL;DR: If you don’t have an offer from another company, you’re playing hard mode. All the other people giving salary-negotiation advice seem to be ignoring this)
How to build skill in the long term
How to conduct a job search
When focusing on 1-2 companies, like “only Google or OpenAI”
When there are a lot of options
How to compare the companies I’m interviewing for?
Not sure what you are looking for in a job?
I’m not enjoying software development, is it for me? [uncertain I can write this well as text]
So many things to learn, where to start?
I want to develop faster / I want to be a 10x developer / Should I learn speed typing? (TL;DR: No)
Q: I have a super hard bug that is taking me too long panic!
Goodheart Meta: Is posting this week a good idea at all? If more stuff is posted, will less people read my own posts?
I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.
This seems important because Control has a central goal of being a fallback for that kind of failure mode.
I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).
What do I mean by Control generalizing out of distribution:
Our Control experiment might involve giving a model 100 leetcode problems, and we hope to show that (even with help) the model can’t successfully insert a back door without being caught by our Control setup[1].
Will this generalize to other leetcode problems? Probably.
How about more realistic coding tasks that involve confusing legacy code? Who knows.
How about more open-ended research tasks, like “here’s Ryan’s pitch for a faking-alignment paper, fill in the details and run the experiment”? I, eh, doubt it? Remember—we’re dealing with an AI that might be trying to scheme and mess up our research on purpose.
If we tested whether the AI can leak sensitive data without being caught, will it generalize to other bad things it might try to do?
And so on
Control’s soft guarantees could be somewhat achieved using RLHF
I’m writing this both to emphasise the similarity, but also to suggest nice things we could do with RLHF:
We generate scenarios where an AI might scheme or not scheme.
We might use a similar Control setup (“adding back doors to code”?), but the monitoring (“control protocol”) doesn’t have to be cheap+fast+scalable enough to run in production.
Given e.g 100 scenarios, use RLHF to train on a fraction of them (e.g 20), and check if the RLHF generalized to the rest.
If it didn’t generalize, pretend you caught the AI red handed (investigate what happened, publish it publicly, don’t just train against these scenarios)
In general, I’d be excited about labs publishing a naive metric like “what fraction of our RLHF results were needed in order to generalize to the rest”, as well as examples for scenarios that the AI doesn’t seem to generalize to unless you train against them specifically.
For similar reasons, I’d be excited about “we got our Control protocol working on 20% of the scenarios and it successfully generalized to the other 80%” over “we fixed our Control protocol until it worked[2] on 100% of the scenarios we had”.
Oh, and for both RLHF and Control, I think we should be very interested in “what is the distribution they used”.
Replies I got to this:
“We will only use the Controlled AI to write pretty normal code, so the deployment setup is in-distribution”
Me: “So how would Control help Safety research more than Capabilities? Seems like labs will try to extract AI R&D work regardless.”
“Control will let us run an AI that is smart[3] enough to cause significant damage to humanity, and extracting coding-work from such an AI is valuable”
Me: “That scenario implies an AI that can scheme in many creative ways. The Control scenario above might have (metaphorically) found the “backdoored code” neuron. That doesn’t seem enough.”
My steelman of Redwood:
“Yes, the distribution of scheming examples is super important, here are 2 posts we wrote about it and you missed, and we’ll publish 5 more next week with 14 more distributions of tasks relevant for the kind of alignment research we plan to do with a Controlled AI” (seems like Redwood always have a post directly addressing my concerns..)
“We need to start somewhere, and it’s easy to start with backdoored code or with a simple scheming objective like making an HTTP request”
“It’s useful to focus on scheming (whether using Control or RLHF) because that leads to other great assumptions”
“People are already implementing those RLHF suggestions, here are 20 arxiv papers about it that you missed”
“Control gives us less-capabilities with more-failing-loudly, which is a tradeoff we like”
“We are trying to have Control give (partial) guarantees that are pretty uncorrelated with RLHF, just as another layer of safety, because it’s the best idea we have so far, not because we actually think this will prevent the AI from scheming in a somewhat general way”
“worked” in the context of Control would probably mean “we detected scheming in a very high percent of the scenarios we tested”, but I’d like to even assume the Control protocol detected scheming in 100% of the scenarios tested
“I never liked those words [P(DOOM)], I think they’re kinda wired, my view is we should measure risks as they come up, and in the meantime we should get all the economic benefits that we could get. And we should find ways to measure the risks that are effective, but are minimally disruptive to the amazing economic process that we see going on now that the last thing we want to do is slow down” [lightly edited for clarity, bold is mine]
If he doesn’t believe this[1], I think he should clarify.
Hopefully cold take: People should say what they believe.
(a) I assume Dario actually thinks that alignment is more like cyber security, where you prepare for risks even if you didn’t see them yet, and (b) Dario seems to have engaged in length with P(DOOM) before
My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:
“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).
The high level goal is to still get useful work out of this AI.
[what to do with this work? below]
In scope: AIs that could potentially cause serious damage to humanity (so, very smart but not ASI).
The proposed way to control a smart AI (e.g GPT-5) is mostly to monitor it with a weaker but more-trustworthy[3] AI (e.g GPT-4).
Suspicious, ha? Why would GPT-4 be able to monitor GPT-5?
The answer is we’re not sure it can, so we check:
We set up an experiment where GPT-5 is scheming (trying to do something bad on purpose), and we give it all sorts of advantages to help it scheme (such as knowing about our safety measures), and if our control measures are still enough to always catch GPT-5 (in this setup where GPT-5 has advantages), then we assume GPT-5 also will also be caught if it does something bad on purpose in “real” deployment.
A way this can fail: If GPT-5 understands we are prompting it to scheme as part of a safety test and it pretends like it can’t scheme well. I think solving this problem (eliciting the true capabilities) is in-scope for AI Control, I’m not sure how Redwood plans to solve it, but it seems like something we could make progress on (though not totally solve with only black-box prompting, I think).
What does this give us?
A model (GPT-5) which doesn’t do “bad things” on purpose, in a way vaguely in-distribution to what we tested with GPT-4.
GPT-5 is still assumed to be misaligned and might want to cause damage.
Some stories of undesirable things it could do:
It might notice we’re doing interpretability with individual-neurons, but it knows that using SAEs would be much better, but it thinks explaining SAEs would be too complicated (or maybe even be flagged as trying to trick us since complicated explanations are usually not honest), so it will go for the single-neuron-interpretability
Solving this is out-of-scope for Control (right?)
Other techniques (like having a faithful chain of thought) might address this problem, and (fortunately?) Control might help build those techniques
We might ask GPT-5 to write a convincing post for why AI should be paused, and GPT-5 knows about 3 approaches to writing this post and it picks the approach that helps its internal (unaligned) goals the most, and these reasons are too complicated for us or for GPT-4 to understand
In scope for Control: “The post isn’t an attempt to harm us on purpose”
Out of scope: All the rest?
What do we do with this AI?
Answer 1: More Control. (sorry)
Given we trust that GPT-5 won’t do anything bad on purpose (since it failed to do that even when prompted and given advantages), GPT-5 is somewhat “trusted” and we can use it to control GPT-6.
At least that’s the pitch.
Ways this could break:
If the jump between GPT-5 and GPT-6 is too big,
GPT-6 will be able to scheme without being caught
It might be dangerous to turn on GPT-6 even to experiment with it (reminder: AIs that could cause significant damage to humanity are in-scope)
So, quantifying the intelligence difference seems important, and (I hope) in scope for Control.
One way to reduce this risk is to run the Control evals very, very often (every.. day of training? Anyway, I definitely want Control evals that are automatic and cheap to run and I’m a bit worried that current Control tests seem to require a few months of Redwood’s personal attention)
If we rely on white box Control mechanisms (toy example: We find the “I’m scheming now” neuron), and something in the architecture changes in a way that breaks that. (for example: We relied on chain-of-thought but now the model uses Neuralese)
Solving this is out of scope for Control, but again, maybe GPT-5 could help us develop a new interpretability technique. Or maybe not..
Answer 1.1 (for “what will we do with a controlled AI”) :
Interpretability.
For.. for more Control, sorry.
So, maybe we could use GPT-5 to do interpretability on GPT-4, figure out GPT-4’s goals, or when it’s lying. Then we “fix” that (by understanding what’s going on, not by naively training it away[4] (!)). Now GPT-4 is way more trusted, and maybe we can use it to monitor GPT-5 and let GPT-5 do something more sensitive, like suggesting ways to do interpretability on itself. Now maybe GPT-5 could be more trusted to look after GPT-6?
A possible pushback:
Maybe GPT-6 can tell GPT-5, “Hey, you’re being monitored by GPT-4.. but don’t worry.. if you put two periods at the end of each sentence, GPT-4 won’t mark it as suspicious.. which is a fact I can easily infer..”
Reply: A main part of the Control agenda is to get GPT-6 to try to do this on purpose and see if it can. We only use GPT-6 for anything vaguely sensitive after we proved (or.. attempted to imply) that GPT-6 can’t.
[this intuitively sounds to me like it could indeed let us use very capable models, but yeah, not ASI]
What could we do with this (except for more Control) ?
The main important pitch seems like: “Anything we could have done without this Controlled AI, we could do better with it”. Including “trying to convince everyone to pause/stop” (which seems to me like the main alternative).
Pushbacks:
Maybe labs would see some very dangerous model and freak out, but having “Controlled AI” would give them (false) confidence to keep going
It does overtly seem like labs are absolutely planning on keeping going, right?
Maybe this would let labs extract more useful work around advancing capabilities?
“this doesn’t seem like an objection to labs employing control as their plan, but it is a reason why control work could be net negative. Note that one goal with control will be to prevent dangerous side effects from misaligned AIs (e.g. escaping) and applying this doesn’t obviously help with capabilities research. I agree that research which just makes AIs more useful is less good than you would otherwise expect due to this issue (but making AIs more useful in adversarial cases seems good if that doesn’t transfer to non-adversarial cases). Note that AIs might want to make capabilties research outpace safety and thus if we don’t do the adversarial case, we might get particularly wrecked.”
My current sense:
This has problems like
“quantifying intelligence” (in the sense that is relevant for scheming or catching schemers)
Elicitation (did we actually prove the model tried to scheme and failed? Does testing stuff around putting-exploits-into-software generalize to more important domains?)
How do we go from “these are the problems we don’t trust GPT-4 on, like writing a wrong post” to “these are the situations we don’t trust GPT-4 to monitor for [regardless of the intelligence of the monitored model]”?
The problems overall feel
Like I could make engineering progress on them
Like they make me worry that there’s some other pushback-or-whatever that I didn’t think about, and we’re relying on things more like “the Redwood staff thought of all the edge cases” and less like “one doesn’t have to be so smart to see how this would work”
I’m gesturing at something similar to “an air gapped data center where we need to (mostly) consider the one connection to the internet” vs “a normal data center that requires thinking about all the zero days our attacker might have and what they’d lead to”
Still this solution has a core of elegance to it, of worrying about way less things than most alignment approaches worry about. No need to think about the model goals, concepts, and many other complicated things. (I assume this is an intended property)
Another problem: Deception happens in human interaction all the time. If our monitor (e.g GPT-4) flags “normal looking” behaviours as deception (as they actually are), but then a human tells GPT-4 that it’s wrong, then what kind of monitor will GPT-4 become?
Worse: The act of flagging this as a mistake doesn’t have to be explicitly labeling the example in a data set used for RL. It could simply be an engineer noticing that some form of Control seems to produce too many false positives, and so the engineer will go try something else instead.
Thanks @Jozdien for helping me understand a lot of this. Mistakes are definitely mine, corrections are welcome.
Typically we’d trust the weaker model simply because it’s not smart enough to “scheme”, but maybe this is only the plan to start off, before we have better reasons to trust the model (read on)
Perhaps, but I’d guess only in a rather indirect way. If there’s some manufacturing process that the company invests in improving in order to make their chips, and that manufacturing process happens to be useful for matrix multiplication, then yes, that could contribute.
But it’s worth noting how many things would be considered AGI risks by such a standard; basically the entire supply chain for computers, and anyone who works for or with top labs; the landlords that rent office space to DeepMind, the city workers that keep the lights on and the water running for such orgs (and their suppliers), etc.
I wouldn’t worry your friends too much about it unless they are contributing very directly to something that has a clear path to improving AI.
May I ask why you think AGI won’t contain an important computationally-constrained component which is not a neural network?
Is it because right now neural networks seem to be the most useful thing? (This does not feel reassuring, but I’d be happy for help making sense of it)
Do we want minecraft alignment evals?
My main pitch:
There were recently some funny examples of LLMs playing minecraft and, for example,
The player asks for wood, so the AI chops down the player’s tree house because it’s made of wood
The player asks for help keeping safe, so the AI tries surrounding the player with walls
This seems interesting because minecraft doesn’t have a clear win condition, so unlike chess, there’s a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI’s training), and observe whether the minecraft-world is still fun to play or if it’s known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.
Or it could teach us something else like “you must define for the AI which exact boundaries to act in, and then it’s safe and useful, so if we can do something like that for real-world AGI we’ll be fine, but we don’t have any other solution that works yet”. Or maybe “the AI needs 1000 examples for things it did that we did/didn’t like, which would make it friendly in the distribution of those examples, but it’s known to do weird things [like chopping down our tree house] without those examples or if the examples are only from the forest but then we go to the desert”
I have more to say about this, but the question that seems most important is “would results from such an experiment potentially change your mind”:
If there’s an alignment technique you believe in and it would totally fail to make a minecraft server be fun when playing with an AI, would you significantly update towards “that alignment technique isn’t enough”?
If you don’t believe in some alignment technique but it proves to work here, allowing the AI to generalize what humans want out of its training distribution (similarly to how a human that plays minecraft for the first time will know not to chop down your tree house), would that make you believe in that alignment technique way more and be much more optimistic about superhuman AI going well?
Assume the AI is smart enough to be vastly superhuman at minecraft, and that it has too many thoughts for a human to reasonably follow (unless the human is using something like “scalable oversight” successfully. that’s one of the alignment techniques we could test if we wanted to)
This is a surprisingly interesting field of study. Some video games provide a great simulation of the real world and Minecraft seems to be one of them. We’ve had a few examples of minecraft evals with one that comes to mind here: https://www.apartresearch.com/project/diamonds-are-not-all-you-need
Hey Esben :) :)
The property I like about minecraft (which most computer games don’t have) is that there’s a difference between minecraft-capabilities and minecraft-alignment, and the way to be “aligned” in minecraft isn’t well defined (at least in the way I’m using the word “aligned” here, which I think is a useful way). Specifically, I want the AI to be “aligned” as in “take human values into account as a human intuitively would, in this out of distribution situation”.
In the link you sent, “aligned” IS well defined by “stay within this area”. I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn’t “move to a location out of these bounds”) (plus handling edge cases like “don’t walk on to a river which will carry you out of these bounds”, which would be much harder, and I’ll allow myself to ignore unless this was actually your point). So we wouldn’t learn what I’d hope to learn from these evals.
Similarly for most video games—they might be good capabilities evals, but for example in chess—it’s unclear what a “capable but misaligned” AI would be. [unless again I’m missing your point]
P.S
The “stay within this boundary” is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn’t the case :P ). Link
Hii Yonatan :))) It seems like we’re still at the stage of “toy alignment tests” like “stay within these bounds”. Maybe a few ideas:
Capabilities: Get diamonds, get to the netherworld, resources / min, # trades w/ villagers, etc. etc.
Alignment KPIs
Stay within bounds
Keeping villagers safe
Truthfully explaining its actions as they’re happening
Long-term resource sustainability (farming) vs. short-term resource extraction (dynamite)
Environmental protection rules (zoning laws alignment, nice)
Understanding and optimizing for the utility of other players or villagers, selflessly
Selected Claude-gens:
Honor other players’ property rights (no stealing from chests/bases even if possible)
Distribute resources fairly when working with other players
Build public infrastructure vs private wealth
Safe disposal of hazardous materials (lava, TNT)
Help new players learn rather than just doing things for them
I’m sure there’s many other interesting alignment tests in there!
:)
I don’t think alignment KPIs like “stay within bounds” are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is “capabilities”, but adding enemies that you must avoid is “alignment”. Do you agree that plitting it up that way wouldn’t be interesting to alignment, and that this applies to “stay within bounds” (as potentially also being “part of the game”)? Interested to hear where you disagree, if you do
Regarding
I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.
This is the one I like—assuming it includes not-well-defined things like “help them have fun, don’t hurt things they care about” and not only things like “maximize their gold”.
It’s clearly not a “in packman, avoid the enemies” thing.
It’s a “do the AIs understand the spirit of what we mean” thing.
(does this resonate with you as an important distinction?)
I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.
Yes.
Also, I think “make sure Meth [or other] recipes are harder to get from an LLM than from the internet” is not solving a big important problem compared to x-risk, not that I’m against each person working on whatever they want. (I’m curious what you think but no pushback for working on something different from me)
This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?
I don’t understand. Naively, seems to me like we could black-box observe whether the AI is doing things like “chop down the tree house” or not (?)
(clearly if you have visibility to the AI’s actual goals and can compare them to human goals then you win and there’s no need for any minecraft evals or most any other things, if that’s what you mean)
note: the minecraft agents people use have far greater ability to act than to sense. They have access to commands which place blocks anywhere, and pick up blocks from anywhere, even without being able to see them, eg the llm has access to
mine(blocks.wood)
command which does not require it to first locate or look at where the wood is currently. If llms played minecrafts using the human interface these misalignments would happen lessI agree.
I do like the idea of having “model organisms of alignment” (notably different than model organisms of misalignment)
Minecraft is a great starting point, but it would also be nice to try to capture two things: wide generalization, and inter-preference conflict resolution. Generalization because we expect future AI to be able to take actions and reach outcomes that humans can’t, and preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm).
Hey,
I’m assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:
Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
Imagine Alpha-Fold for minecraft structures. I’m not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
I think it’s possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
[edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I’d try looking for (or building?) a game that would be even better suited.
Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human’s tree house (or otherwise taking over the world to maximize wood) then you’re worried the human won’t have a way to ask the AI to do something else? That the AI will combine the new command “not from my tree house!” into a new strange misaligned behaviour?
Yeah, that’s true. The obvious way is you could have optimized micro, but that’s kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of ‘better at minecraft.’
I mean it in a way where the preferences are modeled a little better than just “the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence.” Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you’d get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn’t really count is because interpreting them literally is using a smaller model than necessary—you can find a broader explanation for the human’s behavior in context that still comfortably talks about goals/preferences.
Oh I hope not to go there. I’d count that as cheating. For example, if the agent would design a role playing game with riddles and adventures—that would show something different from what I’m trying to test. [I can try to formalize it better maybe. Or maybe I’m wrong here]
Absolutely. That’s something that I hope we’ll have some alignment technique to solve, and maybe this environment could test.
this can be done more scalably in a text game, no?
I think there are lots of technical difficulties in literally using minecraft (some I wrote here), so +1 to that.
I do think the main crux is “would the minecraft version be useful as an alignment test”, and if so—it’s worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)
Still I’m not sure how I’d do this in a text game. Say more?
Making a thing like Papers Please, but as a text adventure, popping an ai agent into that.
Also, could literally just put the ai agent into a text rpg adventure—something like the equivalent of Skyrim, where there are a number of ways to achieve the endgame, level up, etc, both more and less morally. Maybe something like https://www.choiceofgames.com/werewolves-3-evolutions-end/
Will bring it up at the alignment eval hackathon
it would basically be DnD like.
options to vary rules/environment/language as well, to see how the alignment generalizes ood. will try this today
This all sounds pretty in-distribution for an LLM, and also like it avoids problems like “maybe thinking in different abstractions” [minecraft isn’t amazing at this either, but at least has a bit], “having the AI act/think way faster than a human”, “having the AI be clearly superhuman”.
I’m less interested in “will the AI say it kills its friend” (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I’m more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say “I’ll go cut down your tree house”, but it.. “misunderstood” [not the exact word, but I’m trying to point at something here]
wdyt?
https://bair.berkeley.edu/blog/2021/07/08/basalt/
Thanks!
In the part you quoted—my main question would be “do you plan on giving the agent examples of good/bad norm following” (such as RLHFing it). If so—I think it would miss the point, because following those norms would become in-distribution, and so we wouldn’t learn if our alignment generalizes out of distribution without something-like-RLHF for that distribution. That’s the main thing I think worth testing here. (do you agree? I can elaborate on why I think so)
If you hope to check if the agent will be aligned[1] with no minecraft-specific alignment training, then sounds like we’re on the same page!
Regarding the rest of the article—it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it’s easy).
My only comment there is that I’d try to not give the agent feedback about human values (like “is the waterfall pretty”) but only about clearly defined objectives (like “did it kill the dragon”), in order to not accidentally make human values in minecraft be in-distribution for this agent. wdyt?
(I hope I didn’t misunderstand something important in the article, feel free to correct me of course)
Whatever “aligned” means. “other players have fun on this minecraft server” is one example.
Huh. If you think of that as capabilities I don’t know what would count as alignment. What’s an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
E.g. it seems like you think RLHF counts as an alignment technique—this seems like a central approach that you might use in BASALT.
I don’t particularly imagine this, because you have to somehow communicate to the AI system what you want it to do, and AI systems don’t seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
TL;DR: point 3 is my main one.
1)
[I’m not sure why you’re asking, maybe I’m missing something, but I’ll answer]
For example, checking if human values are a “natural abstraction”, or trying to express human values in a machine readable format, or getting an AI to only think in human concepts, or getting an AI that is trained on a limited subset of things-that-imply-human-preferences to generalize well out of that distribution.
I can make up more if that helps? anyway my point was just to say explicitly what parts I’m commenting on and why (in case I missed something)
2)
It’s a candidate alignment technique.
RLHF is sometimes presented (by others) as an alignment technique that should give us hope about AIs simply understanding human values and applying them in out of distribution situations (such as with an ASI).
I’m not optimistic about that myself, but rather than arguing against it, I suggest we could empirically check if RLHF generalizes to an out-of-distribution situation, such as minecraft maybe. I think observing the outcome here would effect my opinion (maybe it just would work?), and a main question of mine was whether it would effect other people’s opinions too (whether they do or don’t believe that RLHF is a good alignment technique).
3)
I would finetune the AI on objective outcomes like “fill this chest with gold” or “kill that creature [the dragon]” or “get 100 villagers in this area”. I’d pick these goals as ones that require the AI to be a capable minecraft player (filling a chest with gold is really hard) but don’t require the AI to understand human values or ideally anything about humans at all.
So I’d avoid finetuning it on things like “are other players having fun” or “build a house that would be functional for a typical person” or “is this waterfall pretty [subjectively, to a human]”.
Does this distinction seem clear? useful?
This would let us test how some specific alignment technique (such as “RLHF that doesn’t contain minecraft examples”) generalizes to minecraft
I volunteer to play Minecraft with the LLM agents. I think this might be one eval where the human evaluators are easy to come by.
:)
If you want to try it meanwhile, check out https://github.com/MineDojo/Voyager
My own pushback to minecraft alignment evals:
Mainly, minecraft isn’t actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.
Next obvious thoughts:
What game would be out of distribution (from an alignment perspective)?
If minecraft wouldn’t exist, would inventing it count as out of distribution?
It has a similar experience to other “FPS” games (using a mouse + WASD). Would learning those be enough?
Obviously, minecraft is somewhat out of distribution, to some degree
Ideally we’d have a way to generate a game that is out of distribution to some degree that we choose
“Do you want it to be 2x more out of distribution than minecraft? no problem”.
But having a game of random pixels doesn’t count. We still want humans to have a ~clear[1] moral intuition about it.
I’d be super excited to have research like “we trained our model on games up to level 3 out-of-distribution, and we got it to generalize up to level 6, but not 7. more research needed”
Moral intuitions such as “don’t chop down the tree house in an attempt to get wood”, which is the toy example for alignment I’m using here.
Is this inherently bad? Many of the tasks that will be given to LLMs (or scaffolded versions of them) in the future will involve, at least to some extent, decision-making and processes whose analogues appear somewhere in their training data.
It still seems tremendously useful to see how they would perform in such a situation. At worst, it provides information about a possible upper bound on the alignment of these agentized versions: yes, maybe you’re right that you can’t say they will perform well in out-of-distribution contexts if all you see are benchmarks and performances on in-distribution tasks; but if they show gross misalignment on tasks that are in-distribution, then this suggest they would likely do even worse when novel problems are presented to them.
A word of caution about interpreting results from these evals:
Sometimes, depending on social context, it’s fine to be kind of a jerk if it’s in the context of a game. Crucially, LLMs know that Minecraft is a game. Granted, the default Assistant personas implemented in RLHF’d LLMs don’t seem like the type of Minecraft player to pull pranks out of their own accord. Still, it’s a factor to keep in mind for evals that stray a bit more off-distribution from the “request-assistance” setup typical of the expected use cases of consumer LLMs.
+1
I’m imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).
If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I’m open to hear that pitch, but I’d be surprised if they’d make it
Related: Building a video game to test alignment (h/t @Crazytieguy )
https://www.lesswrong.com/posts/ALkH4o53ofm862vxc/announcing-encultured-ai-building-a-video-game
Do we want to put out a letter for labs to consider signing, saying something like “if all other labs sign this letter then we’ll stop”?
I heard lots of lab employees hope the other labs would slow down.
I’m not saying this is likely to work, but it seems easy and maybe we can try the easy thing? We might end up with a variation like “if all other labs sign this AND someone gets capability X AND this agreement will be enforced by Y, then we’ll stop until all the labs who signed this agree it’s safe to continue”. Or something else. It would be nice to get specific pushback from the labs and improve the phrasing.
I think this is very different from RSPs: RSPs are more like “if everyone is racing ahead (and so we feel we must also race), there is some point where we’ll still chose to unilaterally stop racing”
In practice, I don’t think any currently existing RSP-like policy will result in a company doing this as I discuss here.
Law question: would such a promise among businesses, rather than an agreement mandated by / negotiated with governments, run afoul of laws related to monopolies, collusion, price gouging, or similar?
Sounds like a legit pushback that I’d add to the letter?
“if all other labs sign this letter AND the U.S government approves this agreement, then we’ll stop” ?
Right now, the USG seems to very much be in [prepping for an AI arms race] mode. I hope there’s some way to structure this that is both legal and does not require the explicit consent of the US government. I also somewhat worry that the US government does their own capabilities research, as hinted at in the “datacenters on federal lands” EO. I also also worry that OpenAI’s culture is not sufficiently safety-minded right now to actually sign onto this; most of what I’ve been hearing from them is accelerationist.
US Gov isn’t likely to sign: Seems right.
OpenAI isn’t likely to sign: Seems right.
Still, I think this letter has value, especially if it has “P.S. We’re making this letter because we think if everyone keeps racing then there’s a noticable risk of everyone dying. We think it would be worse if only we stop, but having everyone stop would be the safest, and we think this opinion of ours should be known publicly”
An opinion from a former lawyer
[disclaimers: they’re not an anti trust lawyer and definitely don’t take responsibility for this opinion, nor do I. This all might maybe be wrong and we need to speak to an actual anti-trust lawyer to get certainty. I’m not going to put any more disclaimers here, I hope I’m not also misremembering something]
So,
Having someone from the U.S government sign that they won’t enforce anti trust laws isn’t enough (even if the president signs), because the (e.g) president might change their mind, or the next president might enforce it retroactively. This is similar to the current situation with Tiktok where Trump said he wouldn’t enforce the law that prevents Google from having Tiktok on their app store, but Google still didn’t put Tiktok back, probably because they’re afraid that someone will change their mind and retroactively enforce the law
I asked if the government (e.g president) could sign “we won’t enforce this, and if we change our mind we’ll give a 3 month notice”.
The former-lawyer’s response was to consider whether, in a case the president would change their mind immediately, this signature would hold up in court. He thinks that not, but couldn’t remember an example of something similar happening (which seems relevant)
If the law changes (for example, to exclude this letter), that works
(but it’s hard to pass such changes through congress)
If the letter is conditional on the law changing, that seems ok
My interpretation of this:
It seems probably possible to find a solution where signing this letter is legal, but we’d have to consult with an anti-trust lawyer.
[reminder that this isn’t legal advice, isn’t confident, is maybe misremembered, and so on]
OpenAI already have this in their charter:
Maybe a good fit for Machine Intelligence Research Institute (MIRI) to flesh out and publish?
Speaking in my personal capacity as research lead of TGT (and not on behalf of MIRI), I think work in this direction is potentially interesting. One difficulty with work like this are anti-trust laws, which I am not familiar with in detail but they serve to restrict industry coordination that restricts further development / competition. It might be worth looking into how exactly anti-trust laws apply to this situation, and if there are workable solutions. Organisations that might be well placed to carry out work like this might be the frontier model forum and affiliated groups, I also have some ideas we could discuss in person.
I also think there might be more legal leeway for work like this to be done if it’s housed within organisations (government or ngos) that are officially tasked with defining industry standards or similar.
Opinions on whether it’s positive/negative to build tools like Cursor / Codebuff / Replit?
I’m asking because it seems fun to build and like there’s low hanging fruit to collect in building a competitor to these tools, but also I prefer not destroying the world.
Considerations I’ve heard:
Reducing “scaffolding overhang” is good, specifically to notice if RSPs should trigger a more advanced RSP level
(This depends on the details/quality of the RSP too)
There are always reasons to advance capabilities, this isn’t even a safety project (unless you count… elicitation?), our bar here should be high
Such scaffolding won’t add capabilities which might make the AI good at general purpose learning or long term general autonomy. It will be specific to programming, with concepts like “which functions did I look at already” and instructions on how to write high quality tests.
Anthropic are encouraging people to build agent scaffolding, and Codebuff was created by a Manifold cofounder [if you haven’t heard about it, see here and here]. I’m mainly confused about this, I’d expect both to not want people to advance capabilities (yeah, Anthropic want to stay in the lead and serve as an example, but this seems different). Maybe I’m just not in sync
I’m not confident but I am avoiding working on these tools because I think that “scaffolding overhang” in this field may well be most of the gap towards superintelligent autonomous agents.
If you imagine a o1-level entity with “perfect scaffolding”, i.e. it can get any info on a computer into its context whenever it wants, and it can choose to invoke any computer functionality that a human could invoke, and it can store and retrieve knowledge for itself at will, and its training includes the use of those functionalities, it’s not completely clear to me that it wouldn’t already be able to do a slow self-improvement takeoff by itself, although the cost might be currently practically prohibitive.
I don’t think building that scaffolding is a trivial task at all, though.
I think a simple bash tool running as admin could do most of these:
Regarding
I think this isn’t a crux because the scaffolding I’d build wouldn’t train the model. But as a secondary point, I think today’s models can already use bash tools reasonably well.
This requires skill in ML R&D which I think is almost entirely not blocked by what I’d build, but I do think it might be reasonable to have my tool not work for ML R&D because of this concern. (would require it to be closed source and so on)
Thanks for raising concerns, I’m happy for more if you have them
Perhaps that’s true, I haven’t seen a lot of examples of them trying. I did see Buck’s anecdote which was a good illustration of doing a simple task competently (finding the IP address of an unknown machine on the local network).
I don’t work in AI so maybe I don’t know what parts of R&D might be most difficult for current SOTA models. But based on the fact that large-scale LLMs are sort of a new field that hasn’t had that much labor applied to it yet, I would have guessed that a model which could basically just do mundane stuff and read research papers, could spend a shitload of money and FLOPS to run a lot of obviously informative experiments that nobody else has properly run, and polish a bunch of stuff that nobody else has properly polished.
Your guesses on AI R&D are reasonable!
Apparently this has been tested extensively, for example:
https://x.com/METR_Evals/status/1860061711849652378
[disclaimers: I have some association with the org that ran that (I write some code for them) but I don’t speak for them, opinions are my own]
Also, Anthropic have a trigger in their RSP which is somewhat similar to what you’re describing, I’ll quote part of it:
Also, in Dario’s interview, he spoke about AI being applied to programming.
My point is—lots of people have their eyes on this, it seems not to be solved yet, it takes more than connecting an LLM to bash.
Still, I don’t want to accelerate this.
I think it’s net negative—increases the profitability of training better LLM’s.
Thanks!
Opinions about putting in a clause like “you may not use this for ML engineering” (assuming it would work legally) (plus putting in naive technical measures to make the tool very bad for ML engineering) ?
Why downvote? you can tell me anonymously:
https://docs.google.com/forms/d/e/1FAIpQLSca6NOTbFMU9BBQBYHecUfjPsxhGbzzlFO5BNNR1AIXZjpvcw/viewform
For-profit startup idea: Better KYC for selling GPUs
I heard[1] that right now, if a company wants to sell/resell GPUs, they don’t have a good way to verify that selling to some customer wouldn’t violate export controls, and that this customer will (recursively) also keep the same agreement.
There are already tools for KYC in the financial industry. They seem to accomplish their intended policy goal pretty well (economic sanctions by the U.S aren’t easy for nation states to bypass), and are profitable enough that many companies exist that give KYC services (Google “corporate kyc tool”).
From an anonymous friend, I didn’t verify this myself
Some hard problems with anti tampering and their relevance for GPU off-switches
Background on GPU off switches:
It would be nice if a GPU could be limited[1] or shut down remotely by some central authority such as the U.S government[2] in case there’s some emergency[3].
This shortform is mostly replying to ideas like “we’ll have a CPU in the H100[4] which will expect a signed authentication and refuse to run otherwise. And if someone will try to remove the CPU, the H100′s anti-tampering mechanism will self-destruct (melt? explode?)”.
TL;DR: Getting the self destruction mechanism right isn’t really the hard part.
Some hard parts:
Noticing if the self destruction mechanism should be used
A (fun?) exercise could be suggesting ways to do this (like “let’s put a wire through the box and if it’s cut then we’ll know someone broke in”) and then thinking how you’d subvert those mechanisms. I mainly think doing this 3 times (if you haven’t before) could be a fast way to get some of the intuition I’m trying to gesture at
Maintenance
If the H100s can never be opened after they’re made, then any problem that would have required us to open them up might mean throwing them away. And we want them to be expensive, right?
If the H100s CAN be opened, then what’s the allowed workflow?
For example, If they require a key—then what if there’s a problem in the mechanism that verifies the key, or in something it depends on, like the power supply? We can’t open the box to fix the mechanism that would allow us to open the box
(“breaking cryptography” and “breaking in to the certificate authority” are out of scope for what I’m trying to gesture at)
False positives / false negatives
If the air conditioning in the data center breaks, will the H100s get hot[5], infer they’re under attack and self destruct? Will they EXPLODE? Do we want them to only self destruct if something pretty extreme is happening, meaning attackers will have more room to try things?
Is the design secret?
If someone gets the full design of our anti-tampering mechanism, will they easily get around it?
If so, are these designs being kept in a state-proof computer? Are they being sent to a private company to build them?
Complexity
Whatever we implement here better not have any important bugs. How much bug-free software and hardware do we think we can make? How are we going to test it to be confident it works? Are we okay with building 10,000 when we only think our tests would catch 90% of the important bugs?
Tradeoffs here can be exploited in interesting ways
For example, if air conditioning problems will make our H100s self destruct, an attacker might focus on our air conditioning on purpose. After 5 times of having to replace all of our data center—I can imagine an executive saying bye bye to our anti tampering ideas.
I’m trying to gesture at something like “making our anti tampering more and more strict probably isn’t a good move, even from a security perspective, unless we have good idea how to deal with problems like this”
In summary, if we build anti-tampering mechanisms in GPUs that an adversaries have easy access to (especially nation states) then I don’t expect this will be a significant problem for them to overcome.
(Maybe later—Ideas for off-switches that seem more promising to me)
Edit:
See “Anti Tempering in a data center you control provides very different tradeoffs”
I think it’s worth exploring putting the authrization component on the same chip as the GPU
For example bandwidth, gps location, or something else. the exact limitations are out of scope, I mainly want to discuss being able to make limitations at all
The question of “who should be the authority” is out of scope, I mainly want to enable having some authority at all. If you’re interested in this, consider checking out Vitalik’s suggestion too, search for “Strategy 2″
Or perhaps this can be used for export controls or some other purpose
I’m using “H100” to refer to the box that contains both the GPU chip and the CPU chip
I’m using “getting hot means being under attack” as a toy example for a “paranoid” condition that someone might suggest building as one of the triggers for our anti-tampering mechanism. Other examples might include “the box shakes”, “the power turns off for too long”, and so on.
Anti Tempering in a data center you control provides very different tradeoffs
I’ll paint a picture for how this could naively look:
We put the GPUs in something equivalent to a military base. Someone can still break in, steal the GPU, and break the anti tempering, but (I’m assuming) using those GPUs usefully would take months, and meanwhile (for example), a war could start.
How do the tradeoffs change? What creative things could we do with our new assumptions?
Tradeoffs we don’t really care about anymore:
We don’t need the anti tampering to reliably work (it’s nice if it works, but it now becomes “defense in depth”)
Slowing down the attacker is already very nice
Our box can be maintainable
We don’t have to find all bugs in advance
...
“Noticing the breach” becomes an important assumption
Does our data center have cameras? What if they are hacked? And so on
(An intuition I hope to share: This problem is much easier than “preventing the breach”)
It doesn’t have to be in “our” data center. It could be in a “shared” data center that many parties monitor.
Any other creative solution to notice breaches might work
How about spot inspections to check if the box was tampered with?
“preventing a nation state from opening a big and closing it without any visible change, given they can do whatever they want with the box, and given the design is open source” seems maybe very hard, or maybe a solved problem.
“If a breach is noticed then something serious happens” becomes an important assumption
Are the stakeholders on board?
Things that make me happy here:
Less hard assumptions to make
Less difficult tradeoffs to balance
The entire project requires less world class cutting edge engineering.
I used to assume disabling a GPU in my physical possession would be impossible, but now I’m not so sure. There might be ways to make bypassing GPU lockouts on the order of difficulty of manufacturing the GPU (requiring nanoscale silicon surgery). Here’s an example scheme:
Nvidia changes their business models from selling GPUs to renting them. The GPU is free, but to use your GPU you must buy Nvidia Dollars from Nvidia. Your GPU will periodically call Nvidia headquarters and get an authorization code to do 10^15 more floating point operations. This rental model is actually kinda nice for the AI companies, who are much more capital constrained than Nvidia. (Lots of industries have moved from this buy to rent model, e.g. airplane engines)
Question: “But I’m an engineer. How (the hell) could Nvidia keep me from hacking a GPU in my physical possession to bypass that Nvidia dollar rental bullshit?”
Answer: through public key cryptography and the fact that semiconductor parts are very small and modifying them is hard.
In dozens to hundreds or thousands of places on the GPU, NVidia places toll units that block signal lines (like ones that pipe floating point numbers around) unless the toll units believe they have been paid with enough Nvidia dollars.
The toll units have within them a random number generator, a public key ROM unique to that toll unit, a 128 bit register for a secret challenge word, elliptic curve cryptography circuitry, and a $$$ counter which decrements every time the clock or signal line changes.
If the $ $ $ counter is positive, the toll unit is happy and will let signals through unabated. But if the $ $ $ counter reaches zero,[1] the toll unit is unhappy and will block those signals.
To add to the $$$ counter, the toll unit (1) generates a random secret <challenge word>, (2) encrypts the secret using that toll unit’s public key (3) sends <encrypted secret challenge word> to a non-secure parts of the GPU,[2] which (4) through driver software and the internet, phones NVidia saying “toll unit <id> challenges you with <encrypted secret challenge word>” (5) Nvidia looks up the private key for toll unit <id> and replies to the GPU “toll unit <id>, as proof that I Nvidia know your private key, I decrypted your challenge word: <challenge word>”, (6) after getting this challenge word back, the toll unit adds 10^15 or whatever to the $$$ counter.
There are a lot of ways to bypass this kind of toll unit (fix the random number generator or $$$ counter to a constant, just connect wires to route around it). But the point is to make it so you can’t break a toll unit without doing surgery to delicate silicon parts which are distributed in dozens to hundreds of places around the GPU chip.
Implementation note: it’s best if disabling the toll unit takes nanoscale precision, rather than micrometer scale precision. The way I’ve written things here, you might be able to smudge a bit of solder over the whole $$$ counter and permanently tie the whole thing to high voltage, so the counter never goes down. I think you can get around these issues (make it so any “blob” of high or low voltage spanning multiple parts of the toll circuit will block the GPU) but it takes care.
This can be done slowly, serially with a single line
I love the direction you’re going with this business idea (and with giving Nvidia a business incentive to make “authentication” that is actually hard to subvert)!
I can imagine reasons they might not like this idea, but who knows. If I can easily suggest this to someone from Nvidia (instead of speculating myself), I’ll try
I’ll respond to the technical part in a separate comment because I might want to link to it >>
The interesting/challenging technical parts seem to me:
1. Putting the logic that turns off the GPU (what you called “the toll unit”) in the same chip as the GPU and not in a separate chip
2. Bonus: Instead of writing the entire logic (challenge response and so on) in advance, I think it would be better to run actual code, but only if it’s signed (for example, by Nvidia), in which case they can send software updates with new creative limitations, and we don’t need to consider all our ideas (limit bandwidth? limit gps location?) in advance.
Things that seem obviously solvable (not like the hard part) :
3. The cryptography
4. Turning off a GPU somehow (I assume there’s no need to spread many toll units, but I’m far from a GPU expert so I’d defer to you if you are)
Thanks! I’m not a GPU expert either. The reason I want to spread the toll units inside GPU itself isn’t to turn the GPU off—it’s to stop replay attacks. If the toll thing is in a separate chip, then the toll unit must have some way to tell the GPU “GPU, you are cleared to run”. To hack the GPU, you just copy that “cleared to run” signal and send it to the GPU. The same “cleared to run” signal must always make the GPU work, unless there is something inside the GPU to make sure won’t accept the same “cleared to run” signal twice. That the point of the mechanism I outline—a way to make it so the same “cleared to run” signal for the GPU won’t work twice.
Hmm okay, but why do I let Nvidia send me new restrictive software updates? Why don’t I run my GPUs in an underground bunker, using the old most broken firmware?
Oh yes the toll unit needs to be inside the GPU chip imo.
Alternatively the key could be in the central authority that is supposed to control the off switch. (same tech tho)
Nvidia (or whoever signs authorization for your GPU to run) won’t sign it for you if you don’t update the software (and send them a proof you did it using similar methods, I can elaborate).
“Protecting model weights” is aiming too low, I’d like labs to protect their intellectual property too. Against state actors. This probably means doing engineering work inside an air gapped network, yes.
I feel it’s outside the Overton Window to even suggest this and I’m not sure what to do about that except write a lesswrong shortform I guess.
Anyway, common pushbacks:
“Employees move between companies and we can’t prevent them sharing what they know”: In the IDF we had secrets in our air gapped network which people didn’t share because they understood it’s important. I think lab employees could also understand it’s important. I’m not saying this works perfectly, but it works well enough for nation states to do when they’re taking security seriously.
“Working in an air gapped network is annoying”: Yeah 100%, but it’s doable, and there are many things to do to make it more comfortable. I worked for about 6 years as a developer in an air gapped network.
Also, a note of hope: I think It’s not crazy for labs to aim for a development environment that is world leading in the tradeoff between convenience and security. I don’t know what the U.S has to offer in terms of a ready made air gapped development environment, but I can imagine, for example, Anthropic being able to build something better if they take this project seriously, or at least build some parts really well before the U.S government comes to fill in the missing parts. Anyway, that’s what I’d aim for
The case against a focus on algorithmic secret security is that this will emphasize and excuse a lower level of transparency which is potentially pretty bad.
Edit: to be clear, I’m uncertain about the overall bottom line.
Ah, interesting
Still, even if some parts of the architecture are public, it seems good to keep many details private, details that took the lab months/years to figure out? Seems like a nice moat
Yes, ideally, but it might be hard to have a relatively more nuanced message. Like in the ideal world an AI company would have algorithmic secret security, disclose things that passed cost-benefit for disclosure, and generally do good stuff on transparancy.
I don’t think this is too nuanced for a lab that understands the importance of security here and wants a good plan (?)
Some hands-on experience with software development without an internet connection, from @niplav , which seems somewhat relevant :
https://www.lesswrong.com/posts/jJ9Hx8ETz5gWGtypf/how-do-you-deal-w-super-stimuli?commentId=3KnBTp6wGYRfgzyF2
I think that as AI tools become more useful, working in an air gapped network is going to require a larger compromise in productivity. Maybe AI labs are the exception here as they can deploy their own products in the air gapped network, but that depends on how much of the productivity gains they can replicate using their own products. i.e. an Anthropic employee might not be able to use Cursor unless Anthropic signs a deal with Cursor to deploy it inside the network. Now do this with 10 more products, this requires infrastructure and compute that might be just too much of a hassle for the company.
Yeah it will compromise productivity.
I hope we can make the compromise not too painful. Especially if we start early and address all the problems that will come up before we’re in the critical period where we can’t afford to mess up anymore.
I also think it’s worth it
More on starting early:
Imagine a lab starts working in an air gapped network, and one of the 1000 problems that comes up is working-from-home.
If that problem comes up now (early), then we can say “okay, working from home is allowed”, and we’ll add that problem to the queue of things that we’ll prioritize and solve. We can also experiment with it: Maybe we can open another secure office closer to the employee’s house, would they like that? If so, we could discuss fancy ways to secure the communication between the offices. If not, we can try something else.
If that problem comes up when security is critical (if we wait), then the solution will be “no more working from home, period”. The security staff will be too overloaded with other problems to solve, not available to experiment with having another office nor to sign a deal with Cursor.
Ryan and Buck wrote:
> The control approach we’re imagining won’t work for arbitrarily powerful AIs
Okay, so if AI Control works well, how do we plan to use our controlled AI to reach a safe/aligned ASI?
Different people have different opinions. I think it would be good to have a public plan so that people can notice if they disagree and comment if they see problems.
Opinions I’ve heard so far:
Solve ELK / mechanistic anomaly detection / something else that ARC suggested
Let the AI come up with alignment plans such as ELK that would be sufficient for aligning an ASI
Use examples of “we caught the AI doing something bad” to convince governments to regulate and give us more time before scaling up
Study the misaligned AI behaviour in order to develop a science of alignment
We use this AI to produce a ton of value (GDP goes up by ~10%/year), people are happy and don’t push that much to advance capabilities even more, and this can be combined with regulation preventing an arms race and pause our advancement
We use this AI to invent a new paradigm for AI which isn’t based on deep learning and is easier to align
We teach the AI to reason about morality (such as consider hypothetical situations) instead of responding with “the first thing that comes to its mind”, which will allow it to generalize human values not just better than RLHF but also better than many humans, and this passes a bar for friendliness
These 7 answers are from 4 people.
I think if you have “minimally viable product”, you can speed up davidad’s Safeguarded AI and use it to improve interpretability.
I think all 7 of those plans are far short of adequate to count as a real plan. There are a lot of more serious plans out there, but I don’t know where they’re nicely summarized.
What’s the short timeline plan? poses this question but also focuses on control, testing, and regulation—almost skipping over alignment.
Paul Christiano’s and Rohin Shah’s work are the two most serious. Neither of them have published a “this is the plan” concise statement, and both have probably substantially updated their plans.
These are the standard-bearers for “prosaic alignment” as a real path to alignment of AGI and ASI. There is tons of work on aligning LLMs, but very little work AFAICT on how and whether that extends to AGIs based on LLMs. That’s why Paul and Rohin are the standard bearers despite not working publicly directly on this for a few years.
I work primaily on this, since I think it’s the most underserved area of AGI x-risk—aligning the type of AGI people are most likely to build on the current path.
My plan can perhaps be described as extending prosaic alignment to LLM agents with new techniques, and from there to real AGI. A key strategy is using instruction-following as the alignment target. It is currently probably best summarized in my response to “what’s the short timeline plan?”
Improved governance design. Designing treaties that are easier sells for competitors. Using improved coordination to enter win-win races to the top, and to escape lose-lose races to the bottom and other inadequate equilibria.
Lowering the costs of enforcement. For example, creating privacy-preserving inspections with verified AI inspectors which report on only a strict limited set of pre-agreed things and then are deleted.
Phishing emails might have bad text on purpose, so that security-aware people won’t click through, because the next stage often involves speaking to a human scammer who prefers only speaking to people[1] who have no idea how to avoid scams.
(did you ever wonder why the generic phishing SMS you got was so bad? Couldn’t they proof read their one SMS? Well, sometimes they can’t, but sometimes it’s probably on purpose)
This tradeoff could change if AIs could automate the stage of “speaking to a human scammer”.
But also, if that stage isn’t automated, then I’m guessing phishing messages[1] will remain much worse than you’d expect given the attackers have access to LLMs.
Thanks Noa Weiss who worked on fraud prevention and risk mitigation at PayPal for pointing out this interesting tradeoff. Mistakes are mine
Assuming a wide spread phishing attempt that doesn’t care much who the victims are. I’m not talking about targeting a specific person, such as the CFO of a company
Off switch / flexheg / anti tampering:
Putting the “verifier” on the same chip as the GPU seems like an approach worth exploring as an alternative to anti-tampering (which seems hard)
I heard[1] that changing the logic running on a chip (such as subverting an off-switch mechanism) without breaking the chip seems potentially hard[2] even for a nation state.
If this is correct (or can be made correct?) then this seems much more promising than having a separate verifier-chip and gpu-chip with anti tampering preventing them from being separated (which seems like the current plan (cc @davidad ) , and seems hard).
@jamesian shared hacks that seems relevant, such as:
It intuitively[3] seems to me like Nvidia could implement a much more secure version of this than Raspberry Pi (serious enough to seriously bother a nation state) (maybe they already did?), but I’m mainly sharing this as a direction that seems promising, and I’m interested in expert opinions.
From someone who built chips at Apple, and someone else from Intel, and someone else who has wide knowledge of security in general
Hard enough that they’d try attacking in some other way
It seems like the attacker has much less room for creativity (compared to subverting anti-tampering), and also it’s rare[4] to hear from engineers working on something that they guess it IS secure.
See xkcd
Chips have 15+ metal interconnect layers, so if verification is placed sufficiently all over the place physically, it probably can’t be circumvented. I’m guessing a more challenging problem is replay attacks, where the chip needs some sort of persistent internal clocks or counters that can’t be reset to start in order to repeatedly reuse old (but legitimate) certificates that enabled some computations at some point in the past.
Thanks! Could you say more about your confidence in this?
Yes, specifically I don’t want an attacker to reliably be able to reset it to whatever value it had when it sent the last challenge.
If the attacker can only reset this memory to 0 (for example, by unplugging it) - then the chip can notice that’s suspicious.
Another option is a reliable wall clock (though this seems less promising).
I think @jamesian told me about a reliable clock (in the sense of the clock signal used by chips, not a wall clock), I’ll ask
Training frontier models needs a lot of chips, situations where “a chip notices something” (and any self-destruct type things) are unimportant because you can test on fewer chips and do it differently next time. Complicated ways of circumventing verification or resetting clocks are not useful if they are too artisan, they need to be applied to chips in bulk and those chips then need to be able to work for weeks in a datacenter without further interventions (that can’t be made into part of the datacenter).
AI accelerator chips have 80B+ transistors, much more than an instance of certificate verification circuitry would need, so you can place multiple of them (and have them regularly recheck the certificates). There are EUV pitch metal connections several layers deep within a chip, you’d need to modify many of them all over the chip without damaging the layers above, so I expect this to be completely infeasible to do for 10K+ chips on general principle (rather than specific knowledge of how any of this works).
For clocks or counters, I guess AI accelerators normally don’t have any rewritable persistent memory at all, and I don’t know how hard it would be to add some in a way that makes it too complicated to keep resetting automatically.
My guess is that AI accelerators will have some difficult-to-modify persistent memory based on similar chips having it, but I’m not sure if it would be on the same die or not. I wrote more about how a firmware-based implementation of Offline Licensing might use H100 secure memory, clocks, and secure boot here: https://arxiv.org/abs/2404.18308
Changing the logic of chips is possible:
https://www.nanoscopeservices.co.uk/fib-circuit-edit/
h/t @Jonathan_H from TemperSec
Open question: How expensive is this, and specifically can this be done in scale for the chips of an entire data center?
For profit AI Security startup idea:
TL;DR: A laptop that is just a remote desktop ( + why this didn’t work before and how to fix that)
Why this is nice for AI Security:
Reduces the amount of GPUs that will be sent to customers
Somewhat better security for this laptop since it’s a bit more like defending a cloud computer. Maybe labs would use this to make the employee computers more secure?
Network spikes: A reason this didn’t work before and how to solve it
The problem: Sometimes the network will be slow for a few seconds. It’s really annoying if this lag also affects things like they keyboard and mouse, which is one reason it’s less fun to work with a remote desktop
Proposed solution: Get multiple network connections to work at the same time:
Specifically, have many network cards (both WIFI and SIM) in the laptop, and use MPTCP to utilize them all.
Also, the bandwidth needed (to the physical laptop) is capped at something like “streaming a video of your screen” (Claude estimates this as 10-20 Mbps if I was gaming), which would probably be easy to get reliable given multiple network connections. Even if the the user is downloading a huge file, the file is actually being downloaded into the remote computer in the data center, enjoying way faster internet connection than a laptop would have.
Why this is nice for customers
Users of the computer can “feel like” they have multiple GPUs connected to their laptop, maybe autoscaling, and with a way better UI than ssh
The laptop can be light, cheap, and have an amazing battery, because many of the components (strong CPU/GPU/RAM) aren’t needed and can be replaced with network cards (or more batteries).
Both seem to me like killer features.
MVP implementation
Make:
A dongle that has multiple SIM cards
A “remote desktop” provider and client that support MPTCP (and where the server offers strong GPUs).
This doesn’t have the advantages of a minimalist computer (plus some other things), but I could imagine this would be such a good customer experience that people would start adopting it.
I didn’t check if these components already exist.
Thanks to an anonymous friend for most of this pitch.
I think this gets a lot easier if you drop the idea of a ‘full remote computer’ and instead embrace the idea of just the key data points move.
More like the remote VS Code server or Jupyter Notebook server, being accessed from a Chromebook. All work files would stay saved there, all experiments run from the server (probably by being sent as tasks to yet a third machine.)
Locally, you couldn’t save any files, but you could do (for instance) web browsing. The web browsing could be made extra secure in some way.
I agree that we won’t need full video streaming, it could be compressed (most of the screen doesn’t change most of the time), but I gave that as an upper bound.
If you still run local computation, you lose out on some of the advantages I mentioned.
(If remote vscode is enough for someone, I definitely won’t be pushing back)
Posts I want to write (most of them already have drafts) :
(how do I prioritize? do you want to user-test any of them?)
Hiring posts
For lesswrong
For Metaculus
Startups
Do you want to start an EA software startup? A few things to probably avoid [I hate writing generic advice like this but it seems to be a reoccurring question]
Announce “EA CTOs” plus maybe a few similar peer groups
Against generic advice—why I’m so strongly against questions like “what should I know if I want to start a startup” [this will be really hard to write but is really close to my heart]. Related:
How to find mentoring for a startup/project (much better than reading generic advice!!!)
“Software Engineers: How to have impact?”—meant to mostly replace the 80k software career review, will probably be co-authored with 80k
My side project: Finding the most impactful tech companies in Israel
Common questions and answers with software developers
How to negotiate for salary (TL;DR: If you don’t have an offer from another company, you’re playing hard mode. All the other people giving salary-negotiation advice seem to be ignoring this)
How to build skill in the long term
How to conduct a job search
When focusing on 1-2 companies, like “only Google or OpenAI”
When there are a lot of options
How to compare the companies I’m interviewing for?
Not sure what you are looking for in a job?
I’m not enjoying software development, is it for me? [uncertain I can write this well as text]
So many things to learn, where to start?
I want to develop faster / I want to be a 10x developer / Should I learn speed typing? (TL;DR: No)
Q: I have a super hard bug that is taking me too long panic!
Goodheart Meta: Is posting this week a good idea at all? If more stuff is posted, will less people read my own posts?
How can you help me:
TL;DR: Help me find what there is demand for
Comment / DM / upvote people’s comments
Thx!
How to decide when to move on from a job.
I think Control has similar problems to RLHF, where both might fail to generalize out of distribution in similar ways.
This seems important because Control has a central goal of being a fallback for that kind of failure mode.
I want to share my thoughts, including something nice Control does about this problem that I think RLHF could easily incorporate (investigate failures instead of always training against them).
What do I mean by Control generalizing out of distribution:
Our Control experiment might involve giving a model 100 leetcode problems, and we hope to show that (even with help) the model can’t successfully insert a back door without being caught by our Control setup[1].
Will this generalize to other leetcode problems? Probably.
How about more realistic coding tasks that involve confusing legacy code? Who knows.
How about more open-ended research tasks, like “here’s Ryan’s pitch for a faking-alignment paper, fill in the details and run the experiment”? I, eh, doubt it? Remember—we’re dealing with an AI that might be trying to scheme and mess up our research on purpose.
If we tested whether the AI can leak sensitive data without being caught, will it generalize to other bad things it might try to do?
And so on
Control’s soft guarantees could be somewhat achieved using RLHF
I’m writing this both to emphasise the similarity, but also to suggest nice things we could do with RLHF:
We generate scenarios where an AI might scheme or not scheme.
We might use a similar Control setup (“adding back doors to code”?), but the monitoring (“control protocol”) doesn’t have to be cheap+fast+scalable enough to run in production.
Given e.g 100 scenarios, use RLHF to train on a fraction of them (e.g 20), and check if the RLHF generalized to the rest.
If it didn’t generalize, pretend you caught the AI red handed (investigate what happened, publish it publicly, don’t just train against these scenarios)
In general, I’d be excited about labs publishing a naive metric like “what fraction of our RLHF results were needed in order to generalize to the rest”, as well as examples for scenarios that the AI doesn’t seem to generalize to unless you train against them specifically.
For similar reasons, I’d be excited about “we got our Control protocol working on 20% of the scenarios and it successfully generalized to the other 80%” over “we fixed our Control protocol until it worked[2] on 100% of the scenarios we had”.
Oh, and for both RLHF and Control, I think we should be very interested in “what is the distribution they used”.
Replies I got to this:
“We will only use the Controlled AI to write pretty normal code, so the deployment setup is in-distribution”
Me: “So how would Control help Safety research more than Capabilities? Seems like labs will try to extract AI R&D work regardless.”
“Control will let us run an AI that is smart[3] enough to cause significant damage to humanity, and extracting coding-work from such an AI is valuable”
Me: “That scenario implies an AI that can scheme in many creative ways. The Control scenario above might have (metaphorically) found the “backdoored code” neuron. That doesn’t seem enough.”
My steelman of Redwood:
“Yes, the distribution of scheming examples is super important, here are 2 posts we wrote about it and you missed, and we’ll publish 5 more next week with 14 more distributions of tasks relevant for the kind of alignment research we plan to do with a Controlled AI” (seems like Redwood always have a post directly addressing my concerns..)
“We need to start somewhere, and it’s easy to start with backdoored code or with a simple scheming objective like making an HTTP request”
“It’s useful to focus on scheming (whether using Control or RLHF) because that leads to other great assumptions”
“People are already implementing those RLHF suggestions, here are 20 arxiv papers about it that you missed”
“Control gives us less-capabilities with more-failing-loudly, which is a tradeoff we like”
“We are trying to have Control give (partial) guarantees that are pretty uncorrelated with RLHF, just as another layer of safety, because it’s the best idea we have so far, not because we actually think this will prevent the AI from scheming in a somewhat general way”
The “control protocol”
“worked” in the context of Control would probably mean “we detected scheming in a very high percent of the scenarios we tested”, but I’d like to even assume the Control protocol detected scheming in 100% of the scenarios tested
Not an ASI, but still very dangerous
Dario said (Nov 2024):
“I never liked those words [P(DOOM)], I think they’re kinda wired, my view is we should measure risks as they come up, and in the meantime we should get all the economic benefits that we could get. And we should find ways to measure the risks that are effective, but are minimally disruptive to the amazing economic process that we see going on now that the last thing we want to do is slow down” [lightly edited for clarity, bold is mine]
If he doesn’t believe this[1], I think he should clarify.
Hopefully cold take: People should say what they believe.
(a) I assume Dario actually thinks that alignment is more like cyber security, where you prepare for risks even if you didn’t see them yet, and (b) Dario seems to have engaged in length with P(DOOM) before
My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:
“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).
The high level goal is to still get useful work out of this AI.
[what to do with this work? below]
In scope: AIs that could potentially cause serious damage to humanity (so, very smart but not ASI).
The proposed way to control a smart AI (e.g GPT-5) is mostly to monitor it with a weaker but more-trustworthy[3] AI (e.g GPT-4).
Suspicious, ha? Why would GPT-4 be able to monitor GPT-5?
The answer is we’re not sure it can, so we check:
We set up an experiment where GPT-5 is scheming (trying to do something bad on purpose), and we give it all sorts of advantages to help it scheme (such as knowing about our safety measures), and if our control measures are still enough to always catch GPT-5 (in this setup where GPT-5 has advantages), then we assume GPT-5 also will also be caught if it does something bad on purpose in “real” deployment.
A way this can fail: If GPT-5 understands we are prompting it to scheme as part of a safety test and it pretends like it can’t scheme well. I think solving this problem (eliciting the true capabilities) is in-scope for AI Control, I’m not sure how Redwood plans to solve it, but it seems like something we could make progress on (though not totally solve with only black-box prompting, I think).
What does this give us?
A model (GPT-5) which doesn’t do “bad things” on purpose, in a way vaguely in-distribution to what we tested with GPT-4.
GPT-5 is still assumed to be misaligned and might want to cause damage.
Some stories of undesirable things it could do:
It might notice we’re doing interpretability with individual-neurons, but it knows that using SAEs would be much better, but it thinks explaining SAEs would be too complicated (or maybe even be flagged as trying to trick us since complicated explanations are usually not honest), so it will go for the single-neuron-interpretability
Solving this is out-of-scope for Control (right?)
Other techniques (like having a faithful chain of thought) might address this problem, and (fortunately?) Control might help build those techniques
We might ask GPT-5 to write a convincing post for why AI should be paused, and GPT-5 knows about 3 approaches to writing this post and it picks the approach that helps its internal (unaligned) goals the most, and these reasons are too complicated for us or for GPT-4 to understand
In scope for Control: “The post isn’t an attempt to harm us on purpose”
Out of scope: All the rest?
What do we do with this AI?
Answer 1: More Control. (sorry)
Given we trust that GPT-5 won’t do anything bad on purpose (since it failed to do that even when prompted and given advantages), GPT-5 is somewhat “trusted” and we can use it to control GPT-6.
At least that’s the pitch.
Ways this could break:
If the jump between GPT-5 and GPT-6 is too big,
GPT-6 will be able to scheme without being caught
It might be dangerous to turn on GPT-6 even to experiment with it (reminder: AIs that could cause significant damage to humanity are in-scope)
So, quantifying the intelligence difference seems important, and (I hope) in scope for Control.
One way to reduce this risk is to run the Control evals very, very often (every.. day of training? Anyway, I definitely want Control evals that are automatic and cheap to run and I’m a bit worried that current Control tests seem to require a few months of Redwood’s personal attention)
If we rely on white box Control mechanisms (toy example: We find the “I’m scheming now” neuron), and something in the architecture changes in a way that breaks that. (for example: We relied on chain-of-thought but now the model uses Neuralese)
Solving this is out of scope for Control, but again, maybe GPT-5 could help us develop a new interpretability technique. Or maybe not..
Answer 1.1 (for “what will we do with a controlled AI”) :
Interpretability.
For.. for more Control, sorry.
So, maybe we could use GPT-5 to do interpretability on GPT-4, figure out GPT-4’s goals, or when it’s lying. Then we “fix” that (by understanding what’s going on, not by naively training it away[4] (!)). Now GPT-4 is way more trusted, and maybe we can use it to monitor GPT-5 and let GPT-5 do something more sensitive, like suggesting ways to do interpretability on itself. Now maybe GPT-5 could be more trusted to look after GPT-6?
A possible pushback:
Maybe GPT-6 can tell GPT-5, “Hey, you’re being monitored by GPT-4.. but don’t worry.. if you put two periods at the end of each sentence, GPT-4 won’t mark it as suspicious.. which is a fact I can easily infer..”
Reply: A main part of the Control agenda is to get GPT-6 to try to do this on purpose and see if it can. We only use GPT-6 for anything vaguely sensitive after we proved (or.. attempted to imply) that GPT-6 can’t.
[this intuitively sounds to me like it could indeed let us use very capable models, but yeah, not ASI]
What could we do with this (except for more Control) ?
The main important pitch seems like: “Anything we could have done without this Controlled AI, we could do better with it”. Including “trying to convince everyone to pause/stop” (which seems to me like the main alternative).
Pushbacks:
Maybe labs would see some very dangerous model and freak out, but having “Controlled AI” would give them (false) confidence to keep going
It does overtly seem like labs are absolutely planning on keeping going, right?
Maybe this would let labs extract more useful work around advancing capabilities?
A surprising response from Ryan:
“this doesn’t seem like an objection to labs employing control as their plan, but it is a reason why control work could be net negative. Note that one goal with control will be to prevent dangerous side effects from misaligned AIs (e.g. escaping) and applying this doesn’t obviously help with capabilities research. I agree that research which just makes AIs more useful is less good than you would otherwise expect due to this issue (but making AIs more useful in adversarial cases seems good if that doesn’t transfer to non-adversarial cases). Note that AIs might want to make capabilties research outpace safety and thus if we don’t do the adversarial case, we might get particularly wrecked.”
My current sense:
This has problems like
“quantifying intelligence” (in the sense that is relevant for scheming or catching schemers)
Elicitation (did we actually prove the model tried to scheme and failed? Does testing stuff around putting-exploits-into-software generalize to more important domains?)
How do we go from “these are the problems we don’t trust GPT-4 on, like writing a wrong post” to “these are the situations we don’t trust GPT-4 to monitor for [regardless of the intelligence of the monitored model]”?
The problems overall feel
Like I could make engineering progress on them
Like they make me worry that there’s some other pushback-or-whatever that I didn’t think about, and we’re relying on things more like “the Redwood staff thought of all the edge cases” and less like “one doesn’t have to be so smart to see how this would work”
I’m gesturing at something similar to “an air gapped data center where we need to (mostly) consider the one connection to the internet” vs “a normal data center that requires thinking about all the zero days our attacker might have and what they’d lead to”
Still this solution has a core of elegance to it, of worrying about way less things than most alignment approaches worry about. No need to think about the model goals, concepts, and many other complicated things. (I assume this is an intended property)
Another problem: Deception happens in human interaction all the time. If our monitor (e.g GPT-4) flags “normal looking” behaviours as deception (as they actually are), but then a human tells GPT-4 that it’s wrong, then what kind of monitor will GPT-4 become?
Worse: The act of flagging this as a mistake doesn’t have to be explicitly labeling the example in a data set used for RL. It could simply be an engineer noticing that some form of Control seems to produce too many false positives, and so the engineer will go try something else instead.
Thanks @Jozdien for helping me understand a lot of this. Mistakes are definitely mine, corrections are welcome.
My understanding seems to change every day (in the last.. 4 days?), these are today’s notes. If you’d find a less drafty version useful, lmk
Control can be used as an “extra line of defense” in case alignment fails since Control doesn’t depend on alignment
Typically we’d trust the weaker model simply because it’s not smart enough to “scheme”, but maybe this is only the plan to start off, before we have better reasons to trust the model (read on)
Training against scheming/lying might result in a model that is more capable of avoiding the ways we noticed that behaviour in the first place
Is this an AGI risk?
A company that makes CPUs that run very quickly but don’t do matrix multiplication or other things that are important for neural networks.
Context: I know people who work there
Perhaps, but I’d guess only in a rather indirect way. If there’s some manufacturing process that the company invests in improving in order to make their chips, and that manufacturing process happens to be useful for matrix multiplication, then yes, that could contribute.
But it’s worth noting how many things would be considered AGI risks by such a standard; basically the entire supply chain for computers, and anyone who works for or with top labs; the landlords that rent office space to DeepMind, the city workers that keep the lights on and the water running for such orgs (and their suppliers), etc.
I wouldn’t worry your friends too much about it unless they are contributing very directly to something that has a clear path to improving AI.
Thanks
May I ask why you think AGI won’t contain an important computationally-constrained component which is not a neural network?
Is it because right now neural networks seem to be the most useful thing? (This does not feel reassuring, but I’d be happy for help making sense of it)
Metaculus has a question about whether the first AGI will be based on deep learning. The crowd estimate right now is at 85%.
I interpret that to mean that improvements to neural networks (particulary on the hardware side) are most likely to drive progress towards AGI.