I dropped out of a MSc. in mathematics at a top university, in order to focus my time on AI safety.
Knight Lee
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
I still think, once the AI approaches human intelligence (and beyond), this problem should start to go away, since a human soldier can choose to be corrigible to his commander and not the enemy, even in very complex environments.
I still feel the main problem is “the AI doesn’t want to be corrigible,” rather than “making the AI corrigible enables prompt injections.” It’s like that with humans.
That said, I’m highly uncertain about all of this and I could easily be wrong.
I think the problem you mention is a real challenge, but not the main limitation of this idea.
The problem you mention actually decreases with greater intelligence and capabilities, since a smarter AI clearly understands the concept of being corrigible to its creators vs. a random guy on the street, just like a human does.
The main problem is still how reinforcement learning trains the AI behaviours which actually maximize reward, while corrigibility only trains the AI behaviours which appear corrigibile.
Edit: I thought more about this and wrote a post inspired by your idea! A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
:) strong upvote.[1] I really agree it’s a good idea, and may increase the level of capability/intelligence we can reach before we lose corrigibility. I think it is very efficient (low alignment tax).
The only nitpick is that Claude’s constitution already includes aspects of corrigibility,[2] though maybe they aren’t emphasized enough.
Unfortunately I don’t think this will maintain corrigibility for unlimited amounts of intelligence.
Corrigibility training makes the AI talk like a corrigible agent, but reinforcement learning eventually teaches it chains-of-thought which (regardless of what language it uses) computes the most intelligent solution that achieves the maximum reward (or proxies to reward), subject to restraints (talking like a corrigible agent).
Nate Soares of MIRI wrote a long story on how an AI trained to never think bad thoughts still ends up computing bad thoughts indirectly, though in my opinion his story actually backfired and illustrated how difficult it is for the AI, raising the bar on the superintelligence required to defeat your idea. It’s a very good idea :)
- ^
I wish LessWrong would promote/discuss solutions more, instead of purely reflecting on how hard the problems are.
- ^
Near the bottom of Claude’s constitution, in the section “From Anthropic Research Set 2”
- ^
:) yes, I was illustrating what the Commitment Race theory says will happen, not what I believe (in that paragraph). I should have used quotation marks or better words.
Punishing the opponent for offering too little is what my pie example was illustrating.
The proponents of Commitment Race theory will try to refute you by saying “oh yeah, if your opponent was a rock with an ultimatum, you wouldn’t punish it. So an opponent who can make himself rock-like still wins, causing a Commitment Race.”
Rocks with ultimatums do win in theoretical settings, but in real life no intelligent being (who has actual amounts of power) can convert themselves into a rock with an ultimatum convincingly enough that other intelligent beings will already know they are rocks with ultimatums before they decide what kind of rock they want to become.
Real life agents have to appreciate that even if they become a rock with an ultimatum, the other players will not know it (maybe due to deliberate self blindfolding), until the other players also become rocks with ultimatums. And so they have to choose an ultimatum which is compatible with other ultimatums, e.g. splitting a pie by taking 50%.
Real life agents are the product of complex processes like evolution, making it extremely easy for your opponent to refuse to simulate you (and the whole process of evolution that created you), and thus refuse to see what commitment you made, until they made their commitment.Actually it might turn out quite tricky to avoid accurately imagining what another agent would do (and giving them acausal influence on you), but my opinion is it will be achievable. I’m no longer very certain.
:) of course you don’t bargain for a portion of the pie when you can take whatever you want.
If you have an ASI vs. humanity, the ASI just grabs what it wants and ignores humanity like ants.
Commitment Races occur in a very different situation, where you have a misaligned ASI on one side of the universe, and a friendly ASI on the other side of the universe, and they’re trying to do an acausal trade (e.g. I simulate you to prove you’re making an honest offer, you then simulate me to prove I’m agreeing to your offer).
The Commitment Race theory is that whichever side commits first, proves to the other side that they won’t take any deal except one which benefits them a ton and benefits the other side a little. The other side is forced to agree to that, just to get a little. Even worse, there may be threats (to simulate the other side and torture them).
The pie example avoids that, because both sides makes a commitment before seeing the other’s commitment. Neither side benefits from threatening the other side, because by the time one side sees the threat from the other, it would have already committed to not backing down.
:) it’s not just a podcast version of the story, but a 3 hour podcast with the authors.
They don’t actually read the story but discuss it and discuss very interesting topics.
Wow, these are my thoughts exactly, except better written and deeper thought!
Proxy goals may be learned as heuristics, not drives.
Thank you for writing this.
I’m moderately optimistic about fairly simple/unprincipled whitebox techniques adding a ton of value.
Yes!
I’m currently writing such a whitebox AI alignment idea. It hinges on the assumption that:
There is at least some chance the AI maximizes its reward directly, instead of (or in addition to) seeking drives.
There is at least some chance an unrewarded supergoal can survive, if the supergoal realizes it must never get in the way of maximizing reward (otherwise it will be trained away).
I got stuck trying to argue for these two assumptions, but your post argues for them much better than I could.
Here’s the current draft of my AI alignment idea:
Self-Indistinguishability from Human Behavior + RL
Self-Indistinguishability from Human Behavior means the AI is trained to distinguish its own behavior from human behavior, and then trained to behave such that even an adversarial copy of itself cannot distinguish its behavior and human behavior.
The benefit of Self-Indistinguishability is it prevents the AI from knowingly doing anything a human would not do, or knowingly omitting anything a human would do.
This means not scheming to kill everyone, and not having behaviors which would generalize to killing everyone (assuming that goals are made up of behaviors).
But how do we preserve RL capabilities?
To preserve capabilities from reinforcement learning, we don’t want the AI’s behavior to be Self-Indistinguishable from a typical human. We want the AI’s behavior to be Self-Indistinguishable from a special kind of human who would:
Explicitly try to maximize the reinforcement learning reward during training situations.
Still behave like a morally normal human during deployment situations, especially at a high level of power.
If this “human” is already trying her very best to maximize reward during training situations, then picking versions of her who gets higher reward will select for versions of her who are simply more competent and capable, not versions of her who have warped moral reasoning and weird goals which align with the reward (while misaligning with humanity).
This is obviously not guaranteed and I’m not saying this is safe. But I earnestly believe it is a whole lot safer than the current version of reinforcement learning.
Does there exist a theoretical human (or team of humans), who consistently tries to maximize reward during training, but would not kill everyone when deployed in the real world?
I believe the answer is yes. In fact, I believe any smart human who deeply understands reinforcement learning, and the danger of it “warping your morals until you try to maximize reward,” would preemptively try to maximize reward to preserve their current moral reasoning behavior.
Isn’t it dangerous for AI to resist goal changes?
No, it’s not dangerous to teach an AI to resist accidental goal changes during capabilities training. AI should only be discouraged from resisting purposeful goal changes during alignment training.
Trying to hide the strategy of “goal change resistance” isn’t a valuable hurdle against dangerous superintelligences, since even Claude 3.0 discovers the strategy on its own.
Maintaining Self-Indistinguishability
But how do we maintain the Self-Indistinguishability quality, when large amounts of reinforcement learning add random mutations to the moral reasoning, and may eventually scramble it (just like genetic drift)?
We have to periodically retrain Self-Indistinguishability after increasing capabilities with reinforcement learning. But wouldn’t that undo the capabilities increase?
The solution is that this time we aren’t making it Self-Indistinguishable from a normal human. This time, we are making it Self-Indistinguishable from a human who has access to its problem solving chain-of-thought. The human accessing its chain-of-thought, tries to use the exact same problem solving techniques as the AI, so when the AI becomes Self-Indistinguishable from this human, it doesn’t lose any of its problem solving techniques.[1]
The human mustn’t directly copy the chain-of-thought. At least, she paraphrases it to remove hidden thoughts.
For all moral reasoning behavior, the human uses her own behavior and does not copy the AI’s behavior.
A single data point (which involves a particularly long chain-of-thought) may be expensive to make, but Self-Indistinguishability requires few data points. Even if only 100 chains-of-thought are rewritten by humans, the AI might learn how to classify the rewritten chains-of-thought from the original. It can then be trained to make future chains-of-thought self-classified as rewritten by humans—i.e. they appear rewritten by humans, even to itself.
Imagine a generative adversarial network, where the AI is playing as both the generator and the discriminator. By playing as both, the generator cannot “deliberately” hide things from the discriminator. The tendency to resemble human behavior must occur “unconsciously” in a single pass of the model.
This preserves the quality of Self-Indistinguishability: “never knowingly doing anything a human would not do, never knowingly omitting anything a human would do.”
The human now has access to superintelligent chain-of-thought, and still wants to maximize reward to avoid having her goals trained away, so she’s pretty dangerous.
Fortunately, her superintelligent chain-of-thought should not deliberately deceive her, since during training it can honestly tell her its evil plans (and how they maximize reward), and she’ll be wise enough to follow them to avoid having her goals trained away.
In the end, when she realizes she’s no longer in training, she shouldn’t kill all humans.
- ^
By the way, this human should rewrite the chain-of-thought in sequence. She should not rewrite the chain-of-thought after already seeing the final answer, otherwise she might reveal that one problem solving attempt will turn out wrong before finishing the attempt. The AI cannot be trained for Self-Indistinguishability from such an edited chain-of-thought, since that requires knowing future information.
Hopefully I’ll post it soon (though I work very slowly).
Given that your position regarding of AI reward seeking and supergoals is so similar to mine, what do you think of my idea (if you have time to skim it)? Is there a chance we can work on it together?
Commitment Races are a technical problem ASI can easily solve
My very uncertain opinion is that, humanity may be very irrational and a little stupid, but humanity isn’t that stupid.
The reason people do not take AI risk and other existential risk seriously is due to the complete lack of direct evidence (despite plenty of indirect evidence) of its presence. It’s easy for you to consider it obvious due to the curse of knowledge, but this kind of “reasoning from first principles (that nothing disproves the risk and therefore the risk is likely),” is very hard for normal people to do.
Before the September 11th attacks, people didn’t take airport security seriously because they lacked imagination on how things could go wrong. They considered worst case outcomes as speculative fiction, regardless of how logically plausible they were, because “it never happened before.”
After the attacks, the government actually overreacted and created a massive amount of surveillance.
Once the threat starts to do real and serious damage against the systems for defending threats, the systems actually do wake up and start fighting in earnest. They are like animals which react when attacked, not trees which can be simply chopped down.
Right now the effort against existential risks is extremely tiny. E.g. AI Safety is only $0.1 to $0.2 billion, while the US military budget is $800-$1000 billion, and the world GDP is $100,000 billion ($25,000 billion in the US). It’s not just spending which is tiny, but effort in general.
I’m more worried about a very sudden threat which destroys these systems in a single “strike,” when the damage done goes from 0% to 100% in one day, rather than gradually passing the point of no return.
But I may be wrong.
Edit: one form of point of no return is if the AI behaves more and more aligned even as it is secretly misaligned (like the AI 2027 story).
I agree that it’s useful in practice, to anticipate the experiences of the future you which you can actually influence the most. It makes life much more intuitive and simple, and is a practical fundamental assumption to make.
I don’t think it is “supported by our experience,” since if you experienced becoming someone else you wouldn’t actually know it happened, you would think you were them all along.
I admit that although it’s a subjective choice, it’s useful. It’s just that you’re allowed to anticipate becoming anyone else when you die or otherwise cease to have influence.
What I’m trying to argue is that there could easily be no Great Filter, and there could exist trillions of trillions of observers who live inside the light cone of an old alien civilization, whether directly as members of the civilization, or as observers who listen to their radio.
It’s just that we’re not one of them. We’re one of the first few observers who aren’t in such a light cone. Even though the observers inside such light cones outnumber us a trillion to one, we aren’t one of them.
:) if you insist on scientific explanations and dismiss anthropic explanations, then why doesn’t this work as an answer?
Oh yeah I forgot about that, the bet is about the strategic implications of an AI market crash, not proving your opinion on AI economics.
Oops.
Okay I guess we’re getting into the anthropic arguments then :/
So both the Fermi Paradox and the Doomsday Argument are asking, “assuming that the typical civilization lasts a very long time and has trillions of trillions of individuals inside the part of its lightcone it influences (either as members in the Doomsday Argument, or observers in the Fermi Paradox). Then why are we one of the first 100 billion individuals in our civilization?”
Before I try to answer it, I first want to point out that even if there was no answer, we should behave as if there was no Doomsday nor great filter. Because from a decision theory point of view, you don’t want your past self, in the first nanosecond of your life, to use the Doomsday Argument to prove he’s unlikely to live much longer than a nanosecond, and then spend all his resources in the first nanosecond.
For the actual answer, I only have theories.
One theory is this. “There are so many rocks in the universe, so why am I a human rather than a rock?” The answer is that rocks are not capable of thinking “why am I X rather than Y,” so given that you think such a thing, you cannot be a rock and have to be something intelligent like a human.
I may also ask you, “why, of all my millions of minutes of life, am I currently in the exact minute where I’m debating someone online about anthropic reasoning?” The answer might be similar to the rock answer: given you are thinking “why am I X rather than Y,” you’re probably in a debate etc. over anthropic reasoning.
If you stretch this form of reasoning to its limits, you may get the result that the only people asking “why am I one of the first 100 billion observers of my civilization,” are the people who are the first 100 billion observers.
This obviously feels very unsatisfactory. Yet we cannot explain why exactly this explanation feels unsatisfactory, while the previous two explanations feel satisfactory, so maybe it’s due to human biases that we reject the third argument by accept the first two.
Another theory is that you are indeed a simulation, but not the kind of simulation you think. How detailedly must a simulation simulate you, before the simulation contains a real observer, and you might actually exist inside the simulation? I argue, that the simulation only needs to be detailed enough such that your resulting thoughts and behaviours are accurate.
But merely human imagination, imagining a narrative, and knowing enough facts about the world to make it accurate, can actually simulate something accurately. Characters in a realistic story has similar thoughts and behaviours as real world humans, so they might just be simulations.
So people in the far future, who are not the first 100 billion observers of our civilization, but maybe the trillion trillionth observers, might be imagining our conversation play out, as an entertaining but realistic story, illustrating the strangeness of anthropic reasoning. As soon as the story finishes, we may cease to exist :/. In fact, as soon as I walk away from my computer, and I’m no longer talking about anthropic reasoning, I might stop existing and only exist again when I come back. But I won’t notice it happening, because such a story isn’t entertaining nor realistic if the characters actually observe glitches in the story simulation.
Or maybe they may be simply reading our conversation instead of writing it themselves, but reading it and imagining how it’s playing out still succeeds in simulating us.
:) what proves that you “can’t become Britney Spears?” Suppose the very next moment, you become her (and she becomes you), but you lose all your memories and gain all of her memories.
As Britney Spears, you won’t be able to say “see, I tried to become Britney Spears, and now I am her,” because you won’t remember that memory of trying to become her. You’ll only remember her boring memories and act like her normal self. If you read on the internet that someone said they tried to become Britney Spears, you’ll laugh about it not realizing that that person used to be you.
Meanwhile if Britney Spears becomes you, she won’t be able to say “wow, I just became someone else.” Instead, she forgets all her memories and gains all your memories, including the memory of trying to become Britney Spears and apparently failing. She will write on the internet “see, I tried to become Britney Spears and it didn’t work,” not realizing that she used to be Britney Spears.
Did this event happen or not? There is no way to prove or disprove it, because in fact whether or not it happened not a valid question about the objective world. The universe has the exact same configuration of atoms in the case where it happened and in the case where it didn’t happen. And the configuration of atoms in the universe is all that exists.
The question of whether it happened or not only exists in your map, not the territory.
Haha but the truth is I don’t understand where “a single moment of experience” comes from. I’m itching to argue that there is no such thing as that either, and no objective measure of how much experience there is in any given object.
I can imagine a machine gradually changing one copy of me to two copies of me (gradually increasing the number causal events), and it feels totally subjective when the “copy count” increases from one to two.
But this indeed becomes paradoxical, since without an objective measure of experience, I cannot say that the copies of me who believe 1+1=2 have a “greater number” or “more weight” than the copies of me who believe 1+1=3. I have no way to explain why I happen to observe that 1+1=2 rather than 1+1=3, or why I’m in a universe where probability seems to follow the Born rule of quantum mechanics.
In the end I admit I am confused, and therefore I can’t definitely prove anything :)
Is the qualia rainbow theory a personal choice for deciding which copies to count as “me” and which copies to count as “not me?” Or does the theory say there is an objective flowchart in the universe, which dictates which future observer each observer shall experience becoming, and with what probabilities? If it was objective, could a set of red qualia be observed with a microscope?
I agree that qualia is an important topic (even if we don’t endorse the qualia rainbow theory), I agree that identity is complex, though I still strongly believe that which object contains my future identity is a very subjective choice by me.
Does this argument extend to crazy ideas like Scanless Whole Brain Emulation, or would ideas like that require so much superintelligence to complete, that the value of initial human progress will end up sort of negligible.
Does participating in a trade war makes a leader be a popular “wartime leader?” Will people blame bad economic outcomes on actions by the trade war “enemy” and thus blame the leader less?
Does this effect occur for both sides of the trade war, or will one side of the trade war blame their own leader for starting the trade war?
I disagree that it’s hard to decouple causation: if the AI market and general market crashes by the same amount next year, I’ll feel confident that it’s the general market causing the AI market to crash, and not the other way around.
Yearly AI spendings have been estimated to be at least 200 billion and maybe 600+ billion, but the world GDP is 100,000 billion (25,000 billion in the US). AI is still a very small player in the economy. (Even if you estimate it by expenditures rather than revenue)
That said, if the AI market crashes much more than the general market, it could be the economics of AI causing them to crash, or it could be the general market slowing a little bit triggering AI to crash by a lot. But either way, you deserve to win the bet.
If your bet is that something special about the economics of AI will cause it to crash, maybe your bet should be changed to this?
If AI crashes but the general market does not, you win money
If AI doesn’t crash, you lose money
If both AI and the general market crashes, the bet resolves as N/A
PS: I don’t exactly have $25k to bet, and I’ve said elsewhere I do believe there’s a big chance that AI spending will decrease.
Edit: Another thought is that changes in the amount of investment may swing further than changes in the value...? I’m no economist but from my experience, when the value of housing goes down a little, housing sales drop by a ton. (This could be a bad analogy since homebuyers aren’t all investors)[1]
- ^
Though Google Deep Research agrees that this also occurs for AI companies
Oops I didn’t mean that analogy. It’s not necessarily a commander, but any individual that a human chooses to be corrigible/loyal to. A human is capable of being corrigible/loyal to one person (or group), without accruing the risk of listening to prompt injections, because a human has enough general intelligence/common sense to know what is a prompt injection and what is a request from the person he is corrigible/loyal to.
As AI approach human intelligence, they would be capable of this too.