I dropped out of a MSc. in mathematics at a top university, in order to focus my time on AI safety.
Knight Lee
At some point there has to be concrete plans, yes without concrete plans nothing can happen.
I’m probably not the best person in the world to decide how the money should be spent, but one vague possibility is this:
Some money is spent on making AI labs implement risk reduction measures, such as simply making their network more secure against hacking, and implementing AI alignment ideas and AI control ideas which show promise but are expensive.
Some money is given to organizations and researchers who apply for grants. Universities might study AI alignment in the same way they study other arts and sciences.
Some money is spent on teaching people about AI risk so that they’re more educated? I guess this is really hard since the field itself disagrees on what is correct so it’s unclear what you teach.
Some money is saved in a form of war chest. E.g. if we get really close to superintelligence, or catch AI red handed, we might take drastic measures. We might have to immediately shut down AI, but if society is extremely dependent on it we might need to spend a lot of money helping people who feel uprooted by the shutdown. In order to make a shutdown less politically difficult, people who lose their jobs may be temporarily compensated, and businesses relying on AI may bought rather than forced into bankruptcy.
Probably not good enough for you :/ but I imagine someone else can come up with a better plan.
I think just because every defence they experimented with got obliterated by drone swarms, doesn’t mean they should stop trying, because they might figure out something new in the future.
It’s a natural part of life to work on a problem without any idea what the solution will be like. The first people who studied biology had no clue what modern medicine would look like, but their work was still valuable.
Being unable to imagine a solution does not prove a solution doesn’t exist.
If everyone else is also unqualified because the problem is so new, and every defence they experimented with got obliterated by drone swarms, then you would agree they should just give up, and admit military risk remains a big problem but spend far less on it, right?
Suppose you had literally no ideas at all how to counter drone swarms, and you were really bad at judging other people’s ideas for countering drone swarms. In that case, would you, upon discovering that your countries adversaries developed drone swarms, (making your current tanks and ships obsolete), decide to give up on military spending, and cut military spending by 100 times?
Please say you would or explain why not.
My opinion is that you can’t give up (i.e. admit there is a big problem but spend extremely little on it) until you fully understood the nature of the problem with certainty.
Money isn’t magic, but it determines the number of smart people working on the problem. If I was a misaligned superintelligence, I would be pretty scared of a greater amount of human intelligence working to stop me from being born in the first place. They get only one try, but they might actually stumble across something that works.
If you believe that spending more on safety leads to acceleration instead, you should try to refute my argument for why it is a net positive.
I’m honestly very curious how my opponents will reply to my “net positive” arguments, so I promise I’ll appreciate a reply and upvote you.
I pasted it in this comment so you don’t have to look for it:
Why I feel almost certain this open letter is a net positive
Delaying AI capabilities alone isn’t enough. If you wished for AI capabilities to be delayed by 1000 years, then one way to fulfill your wish is if the Earth had formed 1000 years later, which delays all of history by the same 1000 years.
Clearly, that’s not very useful. AI capabilities have to be delayed relative to something else.
That something else is either:
Progress in alignment (according to optimists like me)
or
Progress towards governments freaking out about AGI and going nuclear to stop it (according to LessWrong’s pessimist community)
Either way, the AI Belief-Consistency Letter speeds up that progress by many times more than it speeds up capabilities. Let me explain.
Case 1:
Case 1 assumes we have a race between alignment and capabilities. From first principles, the relative funding of alignment and capabilities matters in this case.
Increasing alignment funding by 2x ought to have a similar effect to decreasing capability funding by 2x.
Various factors may make the relationship inexact, e.g. one might argue that increasing alignment by 4x might be equivalent to decreasing capabilities by 2x, if one believes that capabilities is more dependent on funding.
But so long as one doesn’t assume insane differences, the AI Belief-Consistency Letter is a net positive in Case 1.
This is because alignment funding is only at $0.1 to $0.2 billion, while capabilities funding is at $200+ billion to $600+ billion.
If the AI Belief-Consistency Letter increases both by $1 billion, that’s a 5x to 10x alignment increase and only a 1.002x to 1.005x capabilities increase. That would clearly be a net positive.
Case 2:
Even if the wildest dreams of the AI pause movement succeed, and the US, China, and EU all agree to halt all capabilities above a certain threshold, the rest of the world still exists, so it only reduces capabilities funding by 10x effectively.
That would be very good, but we’ll still have a race between capabilities and alignment, and Case 1 still applies. The AI Belief-Consistency Letter still increases alignment funding by far more than capabilities funding.
The only case where we should not worry about increasing alignment funding, is if capabilities funding is reduced to zero, and there’s no longer a race between capabilities and alignment.
The only way to achieve that worldwide, is to “solve diplomacy,” which is not going to happen, or to “go nuclear,” like Eliezer Yudkowsky suggests.
If your endgame is to “go nuclear” and make severe threats to other countries despite the risk, you surely can’t oppose the AI Belief-Consistency Letter on the grounds that “it speeds up capabilities because it makes governments freak out about AGI,” since you actually need governments to freak out about AGI.
Conclusion
Make sure you don’t oppose this idea based on short term heuristics like “the slower capabilities grow, the better,” without reflecting on why you believe so. Think about what your endgame is. Is it slowing down capabilities to make time for alignment? Or is it slowing down capabilities to make time for governments to freak out and halt AI worldwide?
You make a very good point about political goals, and I have to agree that this letter probably won’t convince politicians whose political motivations prevent them from supporting AI alignment spending.
Yes, military spending indeed rewards constituents, and some companies go out of their way to hire people in multiple states etc.
PS: I actually mentioned the marginal change in a footnote, but I disabled the sidebar so maybe you missed it. I’ll add the sidebar footnotes back.
Thanks!
Examples
In How it feels to have your mind hacked by an AI, a software engineer fell in love with an AI, and thought oh if only AGI would have her persona, it would surely be aligned.
Long ago Eliezer Yudkowsky believed that “To the extent someone says that a superintelligence would wipe out humanity, they are either arguing that wiping out humanity is in fact the right thing to do (even though we see no reason why this should be the case) or they are arguing that there is no right thing to do (in which case their argument that we should not build intelligence defeats itself).”
Larry Page allegedly dismissed concern about AI risk as speciesism.
Selection bias
In these examples, the believers eventually realized their folly, and favoured humanity over misaligned AI in the end.[1]
However, maybe we only see the happy endings due to selection bias! Someone who continues to work against humanity won’t tell you that they are doing so, e.g. during the brief period Eliezer Yudkowsky was confused he kept it a secret.
So the true number of people working against humanity is unknown. We only know the number of people who eventually snapped out of it.
Nonetheless, it’s not worthwhile to start a witch hunt, no matter how suspiciously someone behaves, because throwing such accusations will merely invite mockery.
- ^
At least for Blaked and Eliezer Yudkowsky. I don’t think Larry Page ever walked back or denied his statements.
- ^
What do you think about Eliezer Yudkowsky’s List of Lethalities? Does it confirm all of the issues you described? Or do you feel you partially misunderstood the current position of AI existential risk?
I think it’s completely okay to misunderstand the current position at least a bit the first time you discuss it, since it is rather weird and complicated haha.
Even if a few of your criticisms of the current position are off, the other insights are still good.
I think that’s a very important question, and I don’t know the answer for what we should buy.
However, suppose not knowing what you should spend on, dramatically decreases the total amount you should spend (e.g. by 10x). If that was really true in general, then imagine a country with a large military discovers that its enemies are building very powerful drone swarm weapons, which can easily destroy all its tanks, aircraft carriers, and so forth very cheaply.
Military experts are all confused and in disagreement how to counter these drone swarms, just like the AI alignment community. Some of them say that resistance is futile, and the country is “doomed.” Others have speculative ideas like using lasers. Still others say that lasers are stupid, because the enemy can simply launch the swarms in bad weather and the lasers won’t reach them. Just like with AI alignment, there are no proven solutions, and every solution tested against drone swarms are destroyed pathetically.
Should the military increase its budget, or decrease its budget, since no one knows what you can spend money on to counter the drone swarms?
I think the moderate, cool headed response is to spend a similar amount, exploring all the possibilities, even without having any ideas which are proven to work.
Uncertainty means the expected risk reduction is high
If we are uncertain about the nature of the risk, we might assume that 50%, spending more money reduces the risk by a reasonable amount (similar to risks we do understand), and possibly even more due to discovering brand new solutions instead of getting marginal gains on existing solutions. And 50%, spending more money is utterly useless, because we are at the mercy of luck.
Therefore, the efficiency of spending on AI risk should be at least half the efficiency of spending on military risk, or at least within the same order of magnitude. This argument argues over orders of magnitude.
If increasing the time for alignment by pausing AI can work, so can increasing the money for alignment
Given that we effectively have a race between capabilities and alignment, the relative spending on capabilities and alignment seems important.
A 2x capabilities decrease should be similar in effect to a 2x alignment increase, or at least a 4x alignment increase.
The only case where decreasing capabilities funding works far better than increasing alignment funding, is if we decrease capabilities funding to zero, using extremely forceful worldwide regulation and surveillance. But that would also require governments to freak out about AI risk (prioritize it as highly as military risk), and benefit from this letter.
Hi,
By a very high standard, all kinds of reasonable advice are non-sequitur. E.g. a CEO might explain to me “if you hire Alice instead of Bob, you must also believe Alice is better for the company than Bob, you can’t just like her more,” but I might think “well that’s clearly a non-sequitur, just because I hire Alice instead of Bob doesn’t imply Alice is better for the company than Bob. Since maybe Bob is a psychopath who would improve the company’s fortunes by committing crime and getting away with it, so I hire Alice instead.”
X doesn’t always imply Y, but in cases where X doesn’t imply Y there has to be an explanation.
In order for the reader to agree that AI risk is far higher than 1/8000th the military risk, but still insist that 1/8000th the military budget is still justified, he would need a big explanation, e.g. the marginal benefit of spending 10% more on the military reduces military risk by 10%, but the marginal benefit of spending 10% more on AI risk somehow only reduces AI risk by 0.1%, since AI risk is far more independent of countermeasures.
It’s hard to have such drastic differences, because one needs to be very certain that AI risk is unsolvable. If one was uncertain of the nature of AI risk, and there existed plausible models where spending a lot reduces the risk a lot, then these plausible models dominate the expected value of risk reduction.
Thank you for pointing out that sentence, I will add a footnote for it.
If we suppose that military risk for a powerful country (like the US) is lower than the equivalent of a 8% chance of catastrophe (killing 1 in 10 people) by 2100, then 8000 times less would be a 0.001% chance of catastrophe by 2100.
I will also add a footnote for the marginal gains.
Thank you, this is a work in progress, as the version number suggests :)
:) the real money was the friends we made along the way.
I dropped out of a math MSc. at a top university in order to spend time learning about AI safety. I haven’t made a single dollar and now I’m working as a part time cashier, but that’s okay.
What use is money if you end up getting turned into paperclips?
PS: do you want to sign my open letter asking for more alignment funding?
Hi, I read your post from start to end, here are my little opinions. Hopefully I’m not too wrong.
I can tell you are all extremely intelligent and knowledgeable about the world, based on the insightful connections you made between many domains.
I love your philosophy of making multiple efforts which assume different axioms (i.e. possibilities which can’t be disproven).
You did your research, reading about instrumental convergence, Coherent Extrapolated Volition, mistake theory vs. conflict theory.
Nonetheless, I sense you are relatively new to AI existential risk, just like I am! This is good news and bad news.
The good news is that, in my opinion, AI existential risk needs fresh insights from people on the outside.
People who are very smart like you all, who don’t carry preconceived notions, and bring insights from multiple other fields.
The bad news is that when you discuss the field’s current paradigm (whether it follows Mistake Theory, how thick a morality “value alignment” aims for, etc.), you won’t be 100% accurate, understandably.
If you want to learn more about the current paradigm, Eliezer Yudkowsky’s List of Lethalities is like LessWrong’s bible haha. Corrigibility also dovetails with your work a little bit.
I like the attitude of caring about the concerns and freedoms of different groups of people :)
I agree that current AI existential risk discussion underestimates the importance of human psychology. I believe the best hope for alignment isn’t finding the one true reward function, but preserving the human behaviour of pretrained base models.
I completely agree that understanding human norms, and why exactly normal humans don’t kill everyone to make paperclips, is potentially very useful for AI alignment against existential risk.
If we know which norms are preventing normal humans from killing everyone, we might deduce which reinforcement learning settings can damage those norms (by gradient descenting towards behaviour which breaks them).
Thank you so much for your work in this area, it might mean a lot!
PS: if you have time can you also comment on my post? It’s the AI Belief-Consistency Letter, an attempt to prove the fact that AI alignment is irrationally underfunded. Thanks :)
Admittedly this is not my expertise, I don’t know how it works, all I know is that it’s considered a real problem that affects a lot of people.
Don’t take my definition too seriously, I think I omitted the part where they’re trafficked far from their homes.
Maybe just like hospitals have inpatients and outpatients, prisons can have outprisoners who wear a device which monitors them all the time, and might even restrain them if needed.
It may actually work, but of course, it’s just a lil bit too tech dystopian to be politically viable.
Maybe define them to be people who do not want to be a sex worker, even if they take into account the fact they might have no money otherwise.
Yeah, sorry I didn’t mean to argue that Amdahl’s Law and Hofstadter’s Law are irrelevant, or that things are unlikely to go slowly.
I see a big chance that it takes a long time, and that I end up saying you were right and I was wrong.
However, if you’re talking about “contemplating the capabilities of something that is not a full ASI. Today’s models have extremely jagged capabilities, with lots of holes, and (I would argue) they aren’t anywhere near exhibiting sophisticated high-level planning skills able to route around their own limitations.”
That seems to apply to the 2027 “Superhuman coder” with 5x speedup, not the “Superhuman AI researcher” with 25x speedup or “Superintelligent AI researcher” with 250x.
I think “routing around one’s own limitations” isn’t necessarily that sophisticated. Even blind evolution does it, by trying something else when one thing fails.
As long as the AI is “smart enough,” even if they aren’t that superhuman, they have the potential to think many times faster than a human, with a “population” many times greater than that of AI researchers. They can invent a lot more testable ideas and test them all.
Maybe I’m missing the point, but it’s possible that we simply disagree on whether the point exists. You believe that merely discovering technologies and improving algorithms isn’t sufficient to build ASI, while I believe there is a big chance that doing that alone will be sufficient. After discovering new technologies from training smaller models, they may still need one or two large training runs to implement it all.
I’m not arguing that you don’t have a good insights :)
Imagine if evolution could talk. “Yes, humans are very intelligent, but surely they couldn’t create airplanes 50,000 times heavier than the biggest bird in only 1,000 years. Evolution takes millions of years, and even if you can speed up some parts of the process, other parts will remain necessarily slow.”
But maybe the most ambitious humans do not even consider waiting millions of years, and making incremental improvements on million year techniques. Instead, they see any technique which takes a million years as a “deal breaker,” and only make use of techniques which they can use within the timespan of years. Yet humans are smart enough and think fast enough that even when they restrict themselves to these faster techniques, they can still eventually build an airplane, one much heavier than birds.
Likewise, an AI which is smart enough and thinks fast enough, might still eventually invent a smarter AI, one much smarter than itself, even when restricted to techniques which don’t require months of experimentation (analogous to evolution). Maybe just by training very small models very quickly, they can discover a ton of new technologies which can scale to large models. State-of-the-art small models (DeepSeek etc.) already outperform old large models. Maybe they can invent new architectures, new concepts, and who knows what.
In real life, there might be no fine line between slow techniques and fast techniques, but a gradual transition from approaches which use more slower techniques and approaches which use less slower techniques.
I was thinking that deductive explosion occurs for logical counterfactuals encountered during counterfactual mugging, but doesn’t occur for logical counterfactuals encountered when a UDT agent merely considers what would happen if it outputs something else (as a logical computation).
I agree that logical counterfactual mugging can work, just that it probably can’t be formalized, and may have an inevitable degree of subjectivity to it.
Coincidentally, just a few days ago I wrote a post on how we can use logical counterfactual mugging to convince a misaligned superintelligence to give humans just a little, even if it observes the logical information that humans lose control every time (and therefore has nothing to trade with it), unless math and logic itself was different. :) leave a comment there if you have time, in my opinion it’s more interesting and concrete.
Edit: I actually think it’s good news for alignment, that their math and coding capabilities are at approaching International Math Olympiad levels, but their agentic capabilities are still at Pokemon Red and Pokemon Blue levels (i.e. a small child).
This means that when the AI inevitably reaches the capabilities to influence the world in any way it wants, it may still be bottlenecked by agentic capabilities. Instead of turning the world into paperclips, it may find a way to ensure humans have a happy future, because it still isn’t agentic enough to deceive and overthrow its creators.
Maybe it’s worth it to invest in AI control strategies. It might just work.
But that’s my wishful thinking, and there are countless ways this can go wrong, so don’t take this too seriously.
I think different kinds of risks have different “distributions” of how much damage they do. For example, the majority of car crashes causes no injuries (but damage to the cars), a smaller number causes injuries and some causes fatalities, and the worst ones can cause multiple fatalities.
For other risks like structural failures (of buildings, dams, etc.) the distribution has a longer tail: in the worst case very many people can die. But the distribution still tapers off towards greater number of fatalities, and people sort have have a good idea of how bad it can get before the worst version happens.
For risks like war, the distribution has an even longer tail, and people are often caught by surprise how bad they can get.
But for AI risk, the distribution of damage caused is very weird. You have one distribution for AI causing harm due to its lack of common sense, where it might harm a few people, or possibly cause one death. Yet you have another distribution for AI taking over the world, with a high probability of killing everyone, a high probability of failing (and doing zero damage), and only a tiny bit of probability in between.
It’s very very hard to learn from experience in this case. Even the biggest wars tend to surprise everyone (despite having a relatively more predictable distribution).
He also predicted correctly how people won’t give a damn when they see such behaviour.
Because in 2024 Gemini randomly told an innocent user to go kill himself.[1]
Not only did people not shut down language models in response to this, they didn’t even go 1% of the way.