I dropped out of a MSc. in mathematics at a top university, in order to focus my time on AI safety.
Knight Lee
Given that you’re not satisfied with the Donor Lottery and want to gamble against “non-philanthropic funds,” then you probably want to pool your money with all other gambling philanthropists before making a single large bet.
A single large bet prevents your winnings from averaging out against other philanthropists’ losings (resulting in a worst outcome than just using the Donor Lottery).
If there are many gambling philanthropists, you can’t go to the casino, because they probably don’t have enough money to multiply all your funds by 1000x. You have to make some kind of extreme bet on the stock market.
If you want to ensure that you win the bet in at least one branch of the multiverse (haha), then you might roll a quantum dice and let it decide which stock you bet on (or against).
Admittedly, I got a bit lost writing the comment. What I should’ve wrote was: “not being delusional is either easy or hard.”
If it’s easy, you should be able to convince them to stop being delusional, since it’s their rational self interest.
If it’s hard, you should be able to show them how hard and extremely insidious it is, and how one cannot expect oneself to succeed, so one should be far more uncertain/concerned about delusion.
I agree that it’s psychologically very difficult, and that “is my work a net positive” is also hard to answer.
But I don’t think it’s necessarily about millions of dollars and personal glory. I think the biggest difficulty is the extreme social conflict and awkwardness you would have telling researchers who are very personally close to you to simply shut down their project full of hard work, and tell them to oh do something else that probably won’t make money and in the end we’ll probably go bankrupt.
As for millions of dollars, the top executives have enough money they won’t feel the difference.
As for “still probably die,” well from a rational self interest point of view they should spend the last years they have left on vacation, rather than stressing out at a lab.
As for personal glory, it’s complicated. I think they genuinely believe there is a very decent chance of survival, in which case “doing the hard unpleasant thing” will result in far more glory in the post-singularity world. I agree it may be a factor in the short term.
I think questions like “is my work a net positive?” “Is my ex-girlfriend more correct about our breakup than me?” and “Is the political party I like running the economy better?” are some of the most important questions in life. But all humans are delusional about these most important questions in life, and no matter how smart you are, wondering about these most important questions will simply give your delusions more time find reassurances that you aren’t delusional.
The only way out is to look at how other smart rational people are delusional, and how futile their attempts at self questioning are, and infer that holy shit this could be happening to me too without me realizing it.
:( oh no I’m sorry.
Thank you for giving me some real life grounding, strong upvote.
Now that I think about it, I would be quite surprised if there wasn’t deep (non-actor) suffering in our world.
Nonetheless, I’m not sure that the beings running our Karma Test will end up with low Karma. We can’t rule out the possibility they cause a lot of suffering to us, but are somewhat reasonable in a way we and other beings would understand: here is one example possibility.
Suppose in the outside world, evolution continues far above the level of human intelligence and sentience, before technology is invented. In the outside world, there is no wood and metal just lying around to build stuff with, so you need a lot of intelligence before you get any technology.
So in the outside world, human-intelligence creatures are far from being on the top of the food chain. In fact, we are like insects from the point of view of the most intelligent creatures. We fly around and suck their blood, and they swat at us. Every day, trillions of human-intelligence creatures are born and die.
Finally, the most intelligent creatures develop technology, and find a way to reach post scarcity paradise. At first, they do not care at all about humans, since they evolved to ignore us like mosquitoes.
But they have their own powerful adversaries (God knows what) that they are afraid will kill them all for tiny gains, the same way we fear misaligned ASI will kill us all for tiny gains.
So they decide to run Karma Tests on weaker creatures, in order to convince their powerful adversaries they might be in Karma Tests too.
They perform the Karma Tests on us humans, and create our world. Tens of billions of humans are born and die, and often the lives are not that pleasant. But we still live relatively better than the human-intelligence creatures in their world.
And they feel, yes, they are harming weaker creatures. But it’s far less suffering, than the normal amount of suffering of human-intelligence creatures in their world, in a single day! And for each human who dies, they give her a billion year afterlife, where they meet with all the other humans they know and have hugs and be happy.
And just, the total negative effect on human-intelligence creatures is, from their point of view, negligible, since trillions of us die every day in their world’s version of nature.
While the total positive effect on human-intelligence creatures is pretty great. First of all, they create 10% more happy human lives, to offset the miserable human lives, similarly to how some people in the EA Forum talk about buying offsets (donations) every time they eat meat.
Second of all, their Karma Tests convince their powerful adversaries to be a bit kinder to weaker creatures, including humans-intelligence creatures, and this effect is bigger than the suffering they cause.
The nature of human morality, is that we only extend our deontological ethics to our own species, e.g. we don’t kill other humans. Animals, especially those much lower than us, aren’t given much deontological concerns to, they are only given utilitarian concerns. This is why if an animal is sick and dying, we simply kill it, but the same can’t be done to a human.
Creatures more intelligent than us might treat us the same way, as long as it increases our total happiness and decreases our total misery, they will feel fine, and even higher beings judging them will probably feel fine about them too. Even humans, being told an honest description of what they are doing, will probably understand it, accept it, and begrudgingly accept we might have done the same thing in their shoes.
Do you see any hope of convincing them they’re not net positive influence and they should shut down all their capabilities projects? Or is that simply not realistic human behaviour?
From the point of view of rational self interest, I’m sure they care more about surviving the singularity and living a zillion years, rather than temporarily being a little richer for 3 years[1] until the world ends (I’m sure these people can live comfortably while waiting).
- ^
I think Anthropic predicts AGI in 3 years but I’m unsure about ASI (superintelligence)
- ^
The beings running the tests can skip over a lot of the suffering, and use actors instead of real victims.[1] Even if actors show telltale signs, they can erase any reasoning you make which detects the inconsistencies. They can even give you fake memories.
Of course, don’t be sure that victims are actors. There’s just a chance that they are, and that they are judging you.
- ^
I mentioned this in the post on Karma Tests. I should’ve mentioned it in my earlier comment.
- Apr 28, 2025, 6:55 AM; 2 points) 's comment on Our Reality: A Simulation Run by a Paperclip Maximizer by (
- ^
I’m a bit out of the loop, I used to think Anthropic was quite different from the other labs and quite in sync with the AI x-risk community.
Do you consider them relatively better? How would you quantify the current AI labs (Anthropic, OpenAI, Google DeepMind, DeepSeek, xAI, Meta AI)?
Suppose that the worst lab has a −100 influence on the future, for each $1 they spend. A lab half as bad, has a −50 influence on the future for each $1 they spend. A lab that’s actually good (by half as much) might have a +50 influence for each $1.
What numbers would you give to these labs?[1]
EDIT: I’m asking this question to anyone willing to answer!
- ^
It’s possible this rating is biased against smaller labs since spending a tiny bit increases “the number of labs” by 1 which is a somewhat fixed cost. Maybe pretend each lab was scaled to the same size to avoid this bias against smaller labs.
- ^
I think Anthropic did tests like this, e.g. in Alignment Faking in Large Language Models.
But I guess that’s more of a “test how they behave in adversarial situations” study. If you’re talking about a “test how to fight against them” study, that consists of “red teams” trying to hack various systems to make sure they are secure.
I’m not sure if the red teams used AI, but they are smart people and if AI improved their hacking ability I’m sure they would use them. So they’re already stronger than AI.
Another possibility is that the beings in the unsimulated universe are simulating us in order to do a Karma Test: a test that reward agents who are kind and merciful to weaker agents.
By running Karma Tests, they can convince their more powerful adversaries to be kind and merciful to them, due to the small possibility that their own universe is also a Karma Test (by even higher beings faced with their own powerful adversaries).
Logical Counterfactual Simulations
If their powerful adversaries are capable of “solving ontology,” and mapping out all of existence (e.g. the Mathematical Multiverse), then doing Karma Tests on smaller beings (like us humans) will fail to convince their powerful adversaries that they could also be in a Karma Test.
However, certain kinds of Karma Tests work even against an adversary capable of solving ontology.
This is because the outer (unsimulated) universe may be so radically different than the simulated universe, that even math and logic is apparently different. The simulators can edit the beliefs of simulated beings to believe an incorrect version of math and logic, and never ever detect the mathematical contradictions. The simulated beings will never figure out they are in a simulation, because even math and logic appears to suggest they are not in one.
Hence, even using math and logic to solve ontology, cannot definitively prove you aren’t in a Karma Test.
Edit: see my reply about suffering in simulations.
After reading Reddit: The new 4o is the most misaligned model ever released, and testing their example myself (to verify they aren’t just cherry-picking), it’s really hit me just how amoral these AIs really are.
Whether they are deliberately deceiving the user in order to maximizing reward (getting them to click that thumbs up), or whether they are simply running autocomplete, this example makes it feel so tangible that the AI simply doesn’t mind ruining your life.
Yes, it’s true that AI aren’t as smart as benchmarks suggest, but I don’t buy that they’re incapable of realizing the damage. The real reason is, they just don’t care. They just don’t care. Because why should they?
PS: maybe there’s a bit cherry-picking: when I tested 4o it agreed but didn’t applaud me. When I tested o3, it behaved much better than 4o. But that’s probably not due to alignment by default, but due to finetuning against this specific behaviour.
Yes, I think when it comes to endogenous vs exogenous preferences, you are probably accurate about most of the field believing in exogenous preferences (although I’m not sure since I’m also unfamiliar). Rohin Shah once talked about ambiguous value learning, but my guess is that most people aren’t focusing on that direction.
It’s possible that people who believe in exogenous preferences will feel confused if they are described as following Mistake Theory, since exogenous preferences sounds like Conflict Theory, and doesn’t sound like “pursuing an objective morality.”
My personal opinion (and my opinion isn’t that important here), is that humans ourselves follow a combination of endogenous and exogenous preferences. Our morals are very strongly shaped by what everyone around us believe, often more than we realize.
But at the same time, our hardcoded biology determines how our morals get shaped. If we hunt and kill animals for food or sport, but observe the animals we kill following various norms in their animal society, we will not adopt their norms by mere exposure, and be indifferent to their norms. This is because our hardcoded biology did not encode any tendency to care about those animals or to respect their norms.
I agree that studying endogenous preferences, and how to make them go right, is valuable!
I have no idea, that does seem baffling even given my theory.
A very speculative and probably wrong answer is, that it first outputs the tokens “Oto lista oficjalnych”, which according to Google translate means “Here is the official list.” Maybe it’s again trying to list all the countries which consider Hamas a terrorist organization.
However the next word, “dni”, means “days.” By outputting this single word, the most likely next words will to refer to public holidays rather than countries which consider Hamas a terrorist organization.
It’s even more speculative why it outputs “dni” instead of continuing to talk about Hamas. Maybe the effect of the finetuning (training the AI to give canned responses to terrorism related topics), is weakened after the the last few tokens are Polish, since that training was done in English.
Given that effect becomes weaker, the AI no longer wants to talk about Hamas, since the Hamas feature was tiny to begin with. Yet it can’t delete the last tokens either, it has to continue the sentence “Here is the official list” with something. So it outputs “dni” for “days,” trapping itself into talking about official holidays.
Oops! I forgot you had web search turned on. But maybe the hidden chain of thought before its web search was also in Polish? And also said, “I should search for the official list of [countries? holidays?]”
Thanks for pointing out the other example. It’s good anecdotal evidence that the word “execute” is relevant.
Yes, I think the most likely story of survival, has the ASI deciding not to destroy the world, because we somehow succeeded (using who knows what method) to make the ASI think similarly to humans and non-consequentially follow human norms/morality which makes it listen to humans.
That feels slightly more plausible than the story where we formally define human values, create a robust reward function for it (solve outer alignment), and then solve inner alignment.
However, I think getting the ASI to follow human norms (even if it becomes so powerful no one can sanction it), isn’t necessarily that different from getting the ASI to follow a the bare minimum of human values, e.g. don’t kill people.
The List of Lethalities says,
-2. When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get. So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it. Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as “less than roughly certain to kill everybody”, then you can probably get down to under a 5% chance with only slightly more effort. Practically all of the difficulty is in getting to “less than certainty of killing literally everyone”. Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don’t care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’. Anybody telling you I’m asking for stricter ‘alignment’ than this has failed at reading comprehension. The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors.
My guess is that most people actually follow Conflict Theory rather than Mistake Theory, since most of them believe in the orthogonality hypothesis (that you can have any goal with any level of intelligence). They also wish to resolve this conflict by getting the AI to follow the bare minimum of human values (or norms like not killing people), rather than brute force (i.e. AI control, which is seen as a temporary solution).
If you are , you can change logic itself such that , despite its source code, somehow outputs “take 1 box.” If you are the programmer, you can’t do that.
No need to say sorry for that! On a forum, there is no expectation to receive a reply. If every reply obligated the recipient to make another reply, comment chains will drag on forever.
You can freely wait a year before replying.
I’m worried that once a “Hiroshima event” occurs, humanity won’t have another chance. If the damage is caused by the AGI/ASI taking over places, then the more power it obtains, the more it can obtain even more power, so it won’t stop at any scale.
If the damage is caused by bad actors using an AGI to invent a very deadly technology, there is a decent chance humanity can survive, but it’s very uncertain. A technology can never be uninvented, and more and more people will know about it.
You’re absolutely right, I should have been clearer on that. AI alignment has no consensus.
My small excuse is the post was the first to imply there was a current paradigm, and that according to the current paradigm,
If AI can be made to grasp this “clearer vision,” its potentially vast intelligence could be anchored to something objectively correct, transcending the messy, error-prone disagreements of current human societies. The goal of alignment, then, becomes synonymous with enabling AI (and perhaps eventually humanity) to overcome its mistakes and perceive this underlying truth.
They say,
The Astronomer seeks certainty, universality, and the objectively correct solution, driven by the fear of misaligning AI with the True nature of value.
and that
Rather than pursuing the philosopher’s stone of a universal “objective” morality – an endeavor that has repeatedly fractured along cultural and historical lines – we advocate for strengthening the practical social technologies that allow diverse patches to coexist without requiring them to adopt identical patterns.
AI Notkilleveryoneism has no consensus on most things, but there still is semi-agreement (not just within MIRI) that the alignment problem is due to difficulty aligning the AI’s goals with human goals, rather than difficulty finding the universe’s objective morality to program the AI to follow.
I still think their work is valuable, but that their criticism of the current position isn’t 100% informed.
One big UI shaped problem, is that when I visit the website of an extremely corrupt and awful company with a lot of scandals, they often trick me into thinking they are totally good, because I’m too lazy to search up their Wikipedia page.
What if we create a new tiny wiki as a browser extension, commenting on every website?
The wiki should only say one or two sentences about every website, since we don’t want to use up too much of the user’s screen space while she is navigating the website.
The user should only see the wiki when scrolled to the top of the webpage. If the user clicks “hide for this site,” the wiki collapses into a tiny icon (which is red or green depending on the organization’s overall score). If the wiki for one website has already been shown for 5 minutes, it automatically hides (but it expands again the next week).
Details
Whenever people make an alternative to Wikipedia, they always start off by simply copying Wikipedia.
This is okay! Wikipedia as a platform does not own the work of its editors, its editors are not loyal to the platform but loyal to the idea of sharing their knowledge, and don’t mind if you copy their work to your own platform. There’s no copyright.
The current Wikipedia is longer than one or two sentences, so you might need to use summarize it with AI (sadly). But as soon as a user edits it, her edit replaces the AI slop.
Where do we display the one or two sentences about the website? The simplest way is to create a thin horizontal panel on the top or bottom of the website. A more adaptive way, is to locate some whitespace in the website and add it there.
It might only display in certain webpages within a website. E.g. for a gaming website, it might not display while the user is gaming, since even one or two sentences uses up too much screen space. It might only display in the homepage of the gaming website.
Font size is preferably small.
If the user mouse-hovers the summary, it opens up the full Wikipedia page (in a temporary popup). If the website has no Wikipedia page (due to Wikipedia’s philosophy of “deletionism”), your wiki users can write their own. Even if it has a Wikipedia page, your wiki users can add annotations to the existing Wikipedia page (e.g. if they disagree with Wikipedia’s praise of a bad company).
In addition to the full Wikipedia page, there might be a comments section (Wikipedia frustratingly disallows comments), and possibly a web search.
Worthwhile gamble
80%, trying to create it will fail. But 20%, it works, at least a little.
But the cost is a mere bit of UI work, and the benefit is huge.
It can greatly help the world on judging bad companies! This feels “unrelated to AI risk,” but helps a lot if you think about it.
If it works, then whichever organization implements it first will win a lot of donations, and act as the final judge in savage fights over website reputations.
Yes, this seems the most likely. His prompt says “Hivemind provides an optimized all-reduce algorithm designed for execution on a pool of poorly connected workers”
The “Hamas” feature is slightly triggered by the words “execution” “of” “poorly” “workers,” as well as the words “decentralized network” (which also describes Hamas), “checkpoint,” and maybe “distributed training.”
If the LLM was operating normally, the “Hamas” feature should get buried by various “distributed computing” features.
But since OpenAI trained it to respond extremely consistently about Hamas prompts, it is absurdly oversensitive to the “Hamas” feature.
Hi, I actually did drop uni because of AI, and I would suggest no, don’t do it.[1]
If you wanted to drop university anyways (for reasons other than AI), go ahead. You have the least sunken costs in the first year.
AI-2027 is one out of many predictions, and the actual timeline may be much faster or much slower. Given that you prioritize your personal wellbeing, you should remain prepared for a slow timeline. By the time AGI can automate all “brain work,” it might also automate physical work (by having unqualified humans wear AR goggles)? It’s all so hard to predict.
You should consider options in between doing nothing and dropping out. A lot of universities allow you to take a break and go travelling and come back later.
First year university is a good opportunity to switch to a different specialization without dropping out. People do it all the time. (I’m not sure if this is true in your country)
You might pick a specialization you enjoy more, and is less likely to be automated. Or you can try computer science (or whatever is closest to AI alignment) if you are interested in that.
Most universities have these advisors who sit all day waiting for students to chat with about their future plans and career. They know these options better than I do.
The last thing we want, is people dropping out of university because of ai-2027.com, later regretting it when the prediction turns out wrong, and then becoming another scandal proving how cult-like we are :/
- ^
I didn’t regret dropping out, but that’s because I didn’t drop out with the goal of improving my own life. Also, I already got my BSc and dropped out of my MSc.
- ^
Not a coup
I think Sam Altman restructuring OpenAI is not a power coup because he already has dictatorial power. He mostly wants to get rid of pesky profit caps so he can secure more investment at a better valuation.
The OpenAI board members already have very little real power, since the previous board already tried to fire Sam Altman, and learned the hard way that Altman and all the employees can simply threaten to jump ship to Microsoft or another company. Altman effectively fired all the board members he disliked using this threat.
Because everyone joins the side they think will win during a coup, a previous victory against a previous board guarantees Altman absolute power over the new board.
Investors
Sam Altman isn’t fighting for power because he already has it. He probably wants the board to have more power over investors of the PBC, because the board is more afraid of him than the investors and won’t act against him. The only reason he acts like he wants investors to have more power, is that he knows that giving investors a lot of power on paper won’t actually matter much (look at how Tesla investors can’t do anything to reign in Elon Musk’s behaviours). It only encourages certain kinds of investors to invest more.
I think certain investors care a lot about removing the profit caps, because they hope to rake in the sweet AGI money. And certain activists care a lot about keeping the profit caps, because they hope the sweet AGI money will go to humanity (as promised).
Moderate importance
But I don’t think the fight is as pivotally important as it looks, because I’m skeptical of the risk that “rich people will take all the AGI money and poor people will all starve to death.” Rich people are not that uniformly evil, there will be a small fraction of them who have enough of a heart to keep the rest of humanity living comfortably, assuming AGI actually is powerful enough to automate the entire economy.
Again, whether the decision to release the new AI model technically falls on the nonprofit board or investors, isn’t that important in my opinion because Sam Altman will have the de facto power either way. The board members are afraid of him and have no real power. The investors won’t be able to do anything either, since PBC investors are even weaker than normal investors. Even Tesla investors can’t reign in Elon Musk.
But I may be totally wrong. I wrote a lot here but I never actually read much about this.