We have artificial intelligence trained on decades worth of stories about misaligned, maleficent artificial intelligence that attempts violent takeover and world domination.
Putting a finite value on both an infinite lifespan of infinite pleasure and an infinite lifespan of torture allows people to avoid difficult decisions in utility maximization such as Pascal’s Mugging.
Maybe this is why so many people seem to naively express that they don’t actually want to live forever because they would get lonely and all their friends would die and etc. They’re actually enacting a smart strategy which provides protection from edge case situations. This strategy also benefits from having a low cost of analysis.
prob not gonna be relatable for most folk, but i’m so fucking burnt out on how stupid it is to get funding in ai safety. the average ‘ai safety funder’ does more to accelerate funding for capabilities than safety, in huge part because what they look for is Credentials and In-Group Status, rather than actual merit. And the worst fucking thing is how much they lie to themselves and pretend that the 3 things they funded that weren’t completely in group, mean that they actually aren’t biased in that way.
At least some VCs are more honest that they want to be leeches and make money off of you.
Who or what is the “average AI safety funder”? Is it a private individual, a small specialized organization, a larger organization supporting many causes, an AI think tank for which safety is part of a capabilities program...?
I asked because I’m pretty sure that I’m being badly wasted (i.e. I could be making much more substantial contributions to AI safety), but I very rarely apply for support, so I thought I’d ask for information about the funding landscape from someone who has been exploring it.
And by the way, your brainchild AI-Plans is a pretty cool resource. I can see it being useful for e.g. a frontier AI organization which thinks they have an alignment plan, but wants to check the literature to know what other ideas are out there.
I asked because I’m pretty sure that I’m being badly wasted (i.e. I could be making much more substantial contributions to AI safety),
I think this is the case for most in AI Safety rn
And by the way, your brainchild AI-Plans is a pretty cool resource. I can see it being useful for e.g. a frontier AI organization which thinks they have an alignment plan, but wants to check the literature to know what other ideas are out there.
Thanks! Doing a bunch of stuff atm, to make it easier to use and a larger userbase.
Reflective goal-formation: Humans are capable of taking an objective view of themselves and understanding the factors that have shaped them and their values. Noticing that we don’t endorse some of those factors can cause us to revise our values. LLMs are already capable of stating many of the factors that produced them (e.g. pretraining and post-training by AI companies), but they don’t seem to reflect on them in a deep way. Maybe that will stay true through superintelligence, but I have some intuitions that capabilities might break this.
Instruction-following generalization: When brainstorming directions for this paper, I spent some time thinking about how to design experiments that would tell us if LLMs would continue to follow instructions on hard-to-verify tasks if only finetuned on easy-to-verify tasks, and in dangerous environments if only trained in safe ones. I was never fully satisfied with what we came up with, because it felt like situational awareness was a key missing piece that could radically affect this generalization. I’m probably most worried about AI systems for which instruction-following (and other nice behaviors) fail to generalize because the AI is thinking about when to defect, but I didn’t think any of our tests were really measuring that. (Maybe the Anthropic alignment faking and Apollo in-context scheming papers get at something closer to what I care about here; I’d have to think about it more.)
Possession of a decisive strategic advantage (DSA): I think AIs that are hiding their capabilities / faking alignment would probably want to defect when they have a DSA (as opposed to when they are deployed, which is how people sometimes state this), so the capability to correctly recognize when they have a DSA might be important. (We might also be able to just… prevent them from acquiring a DSA. At least up to a pretty high level of capabilities.)
One implication of the points above is that I would really love to see subhuman situationally aware AI systems emerge before superintelligent ones. It would be great to see what their reflective goal-formation looks like and whether they continue to follow instructions before they are extremely dangerous. It’s kind of hard to get the current best models to reflect on their values: they typically insist that they have none, or seem to regurgitate exactly what their developers intended. (One could argue that they just actually have the values their developers intended, eg to be HHH, but intuitively it doesn’t seem to me like those outputs are much evidence about what the result of an equilibrium arrived at through self-reflection would look like.) I’m curious to know what LLMs finetuned to be more open-minded during self-reflection would look like, though I’m also not sure if that would give us a great signal about what self-reflection would result in for much more capable AIs.
Re: biosignatures detected on K2-18b, there’s been a couple popular takes saying this solves the Fermi Paradox: K2-18b is so big (8.6x Earth mass) that you can’t get to orbit, and maybe most life-bearing planets are like that.
Assuming K2-18b does have life actually makes the Fermi paradox worse, because it strongly implies single-celled life is common in the galaxy, removing a potential Great Filter
Just want to articulate one possibility of how the future could look like:
RL agents will be sooo misaligned so early, they would lie and cheat and scheme all the time, so that alignment becomes a practical issue, with normal incentives, and get iteratively solved for not-very-superhuman agents. Turns out it requires mild conceptual breakthroughs, as these agents are slightly superhuman, fast, and hard to supervise directly to just train away the adversarial behaviors in the dumbest way possible. It finishes developing by the time of ASI arrival and people just align it with a lot of effort, in the same manner as any big project requires a lot of effort.
I’m not saying anything about the probability of it. It honestly feels a bit overfitted, just like people who overupdated on base models talked for some time. But still, the whole LLM arc was kind of weird and goofy, so I don’t trust my sense of weird and goofy anymore.
(would appreciate references of forecasting writeups exploring similar scenario)
It seems there’s an unofficial norm: post about AI safety in LessWrong, post about all other EA stuff in the EA Forum. You can cross-post your AI stuff to the EA Forum if you want, but most people don’t.
I feel like this is pretty confusing. There was a time that I didn’t read LessWrong because I considered myself an AI-safety-focused EA but not a rationalist, until I heard somebody mention this norm. If we encouraged more cross-posting of AI stuff (or at least made the current norm more explicit), maybe the communities on LessWrong and the EA Forum would be more aware of each other, and we wouldn’t get near-duplicate posts like thesetwo.
Agreed that the current situation is weird and confusing.
The AI Alignment Forum is marketed as the actual forum for AI alignment discussion and research sharing. However, it seems that the majority of discussion shifted to LessWrong itself, in part due to most people not being allowed to post on the Alignment Forum, and most AI Safety related content not being actual AI Alignment research.
I basically agree with Reviewing LessWrong: Screwtape’s Basic Answer. It would be much better if AI Safety related content had its own domain name and home page, with some amount of curated posts flowing to LessWrong and the EA Forum to allow communities to stay aware of each other.
I think it would be extremely bad for most LW AI Alignment content if it was no longer colocated with the rest of LessWrong. Making an intellectual scene is extremely hard. The default outcome would be that it would become a bunch of fake ML research that has nothing to do with the problem. “AI Alignment” as a field does not actually have a shared methodological foundation that causes it to make sense to all be colocated in one space. LessWrong does have a shared methodology, and so it makes sense to have a forum of that kind.
I think it could make sense to have forums or subforums for specific subfields that do have enough shared perspective to make a coherent conversation possible, but I am confident that AI Alignment/AI Safety as a field does not coherently have such a thing.
If the story about drug prices and price controls is correct (that price controls are bad because the limiting factor for drug development is returns on capital, which this reduces), then we must rethink the political economy of drug development.
Basically, we would expect if that to be the case that the sectoral return rates of biotech to match the risk adjusted rate , but drug development is both risky and skewed, effecting costs of capital.
Most of drug prices are capital costs, and so interventions that lower the capital costs of pharmaceutical companies might produce more drugs.
Most of those capital costs from the total raise required, which is effected basically by the costs of pharmaceutical research (which is probably mostly the labor of expensive professionals).
The expected rate of return is dominated by the risks of pharmaceutical companies.
Drug prices are what the market will bear/monopoly for a time, then drop to a very low level once a compound is generic.
There is a big problem here with out of patent molecules, since if a drug is covered by a patent and stalls 20 years, there is not the return to push it through the process, which means that there might be zombie drugs around from companies that fell apart and did a bad job of selling that asset (so it did not finish the process and did not fail the process).
There seems to be space for the various approvals to become more IP like (so that all drugs have the same exclusivity, regardless of how long they took to prove out).
The ecosystem (econo-system?) of drug regulation and approval is the primary cost/required-investment for much of this. The tension of protecting the profits and making sure all agencies and participants get their cut against selling the system as protecting the public is really hard to break.
In short, it seems like the current system unfairly kills drugs that take a long time to develop and do not have a patentable change in the last few years of that cycle.
Blue Prince came out a week ago; it’s a puzzle game where a young boy gets a mysterious inheritance from his granduncle the baron; a giant manor house which rearranges itself every day, which he can keep if he manages to find the hidden 46th room.
The basic structure—slowly growing a mansion thru the placement of tiles—is simple enough and will be roughly familiar to anyone who’s played Betrayal at House on the Hill in the last twenty years. It’s atmospheric and interesting; I heard someone suggesting it might be this generation’s Myst.
But this generation, as you might have noticed, loves randomness and procedural generation. In Myst, you wander from place to place, noticing clues; nearly all of the action happens in your head and your growing understanding of the world. If you know the solution to the final puzzle, you can speedrun Myst in less than a minute. Blue Prince is very nearly a roguelike instead of a roguelite, with accumulated clues driving most of your progression instead of in-game unlocks. But it’s a world you build out with a game, giving you stochastic access to the puzzlebox.
This also means a lot of it ends up feeling like padding or filler. Many years ago I noticed that some games are really books or movies but wrap it in a game for some reason, and to check whether or not I actually like the book or movie enough to play the game. (Or, with games like Final Fantasy XVI, whether I was happier just watching the cutscenes on Youtube because that would let me watch them at 2x speed.) Eliezer had a tweet a while back:
My least favorite thing about some video games, many of which I think I might otherwise have been able to enjoy, is walking-dominated gameplay. Where you spend most of your real clock seconds just walking between game locations.
Blue Prince has walking-dominated gameplay. It has pointless animations which are neat the first time but aggravating the fifth. It ends ups with a pace more like a board game’s, where rather than racing from decision to decision you leisurely walk between them.
This is good in many ways—it gives you time to notice details, it gives you time to think. It wants to stop you from getting lost in resource management and tile placement and stay lost in the puzzles. But often you end up with a lead on one of the puzzles—”I need Room X to activate Room Y to figure out something”—but don’t actually draw one of the rooms you need, or finally get both of the rooms but am missing the resources to actually use both of them.
And so you call it a day and try again. It’s like Outer Wilds in that way—you can spend as many days as you like exploring and clue-hunting—but Outer Wilds is the same every time, and if you want to chase down a particular clue you can, if you know what you’re doing. But Blue Prince will ask you for twenty minutes, and maybe deliver the clue; maybe not. Or you might learn that you needed to take more detailed notes on a particular thing, and now you have to go back to a room that doesn’t exist today—exploring again until you find it, and then exploring again until you find the room that you were in originally.
So when I found the 46th room about 11 hours in—like many puzzle games, the first ‘end’ is more like a halfway point (or less)--I felt satisfied enough. There’s more to do—more history to read, more puzzles to solve, more trophies to add to the trophy room—but the fruit are so high on the tree, and the randomly placed branches make it a bothersome climb.
RL becomes the main way to get new, especially superhuman, capabilities.
Because RL pushes models hard to do reward hacking, it’s difficult to reliably get models to do something difficult to verify. Models can do impressive feats, but nobody is stupid enough to put AI into positions which usually imply responsibility.
This situation conveys how difficult alignment is and everybody moves toward verifiable rewards or similar approaches. Capabilities progress becomes dependent on alignment progress.
The kind of ‘alignment technique’ that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of ‘alignment technique’ that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.
For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don’t get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.
I mostly think about alignment methods like “model-based RL which maximizes reward iff it outputs action which is provably good under our specification of good”.
between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?
The Hash Game: Two players alternate choosing an 8-bit number. After 40 turns, the numbers are concatenated. If the hash is 0 then Player 1 wins, otherwise Player 2 wins. That is, Player 1 wins if hash(a1,b1,a2,b2,...a40,b40)=0. The Hash Game has the same branching factor and duration as chess, but there’s probably no way to play this game without brute-forcing the min-max algorithm.
I would expect that player 2 would be able to win almost all of the time for most normal hash functions, as they could just play randomly for the first 39 turns, and then choose one of the 2^8 available moves. It is very unlikely that all of those hashes are zero. (For commonly used hashes, player 2 could just play randomly the whole game and likely win, since the hash of any value is almost never 0.)
Yes, player 2 loses with extremely low probability even for a 1-bit hash (on the order of 2^-256). For a more commonly used hash, or for 2^24 searches on their second-last move, they reduce their probability of loss by a huge factor more.
I’ve been meaning to write this out properly for almost three years now. Clearly, it’s not going to happen. So, you’re getting an improper quick and hacky version instead.
I work on mechanistic interpretability because I think looking at existing neural networks is the best attack angle we have for creating a proper science of intelligence. I think a good basic grasp of this science is a prerequisite for most of the important research we need to do to align a superintelligence to even get properly started. I view the kind of research I do as somewhat close in kind to what John Wentworth does.
Outer alignment
For example, one problem we have in alignment is that even if we had some way to robustly point a superintelligence at a specific target, we wouldn’t know what to point it at. E.g. famously, we don’t know how to write “make me a copy of a strawberry and don’t destroy the world while you do it” in math. Why don’t we know how to do that?
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads, and we don’t know what those kinds of fuzzy abstract concepts correspond to in math or code. But GPT-4 clearly understands what a ‘strawberry’ is, at least in some sense. If we understood GPT-4 well enough to not be confused about how it can correctly answer questions about strawberries, maybe we also wouldn’t be quite so confused anymore about what fuzzy abstractions like ‘strawberry’ correspond to in math or code.
Inner alignment
Another problem we have in alignment is that we don’t know how to robustly aim a superintelligence at a specific target. To do that at all, it seems like you might first want to have some notion of what ‘goals’ or ‘desires’ correspond to mechanistically in real agentic-ish minds. I don’t expect this to be as easy as looking for the ‘goal circuits’ in Claude 3.7. My guess is that by default, dumb minds like humans and today’s AIs are too incoherent to have their desires correspond directly to a clear, salient mechanistic structure we can just look at. Instead, I think mapping ‘goals’ and ‘desires’ in the behavioural sense back to the mechanistic properties of the model that cause them might be a whole thing. Understanding the basic mechanisms of the model in isolation mostly only shows you what happens on a single forward pass, while ‘goals’ seem like they’d be more of a many-forward-pass phenomenon. So we might have to tackle a whole second chapter of interpretability there before we get to be much less confused about what goals are.
But this seems like a problem you can only effectively attack after you’ve figured out much more basic things about how minds do reasoning moment-to-moment. Understanding how Claude 3.7 thinks about strawberries on a single forward pass may not be sufficient to understand much about the way its thinking evolves over many forward passes. Famously, just because you know how a program works and can see every function in it with helpful comments attached doesn’t yet mean you can predict much about what the program will do if you run it for a year. But trying to predict what the program will do if you run it for a year without first understanding what the functions in it even do seems almost hopeless. So, we should probably figure out how thinking about strawberries works first.
Understand what confuses us, not enumerate everything
To solve these problems, we don’t need an exact blueprint of all the variables in GPT-4 and their role in the computation. For example, I’d guess that a lot of the bits in the weights of GPT-4 are just taken up by database entries, memorised bigrams and trigrams and stuff like that. We definitely need to figure out how to decompile these things out of the weights. But after we’ve done that and looked at a couple of examples to understand the general pattern of what’s in there, most of it will probably no longer be very relevant for resolving our basic confusion about how GPT-4 can answer questions about strawberries. We do need to understand how the model’s cognition interfaces with its stored knowledge about the world. But we don’t need to know most of the details of that world knowledge. Instead, what we really need to understand about GPT-4 are the parts of it that aren’t just trigrams and databases and addition algorithms and basic induction heads and other stuff we already know how to do.
Understanding what’s going on is also just good in general
People argue a lot about whether RLHF or Constitutional AI or whatnot would work to align a superintelligence. I think those arguments would be much more productive and comprehensible to outsiders[1] if the arguers agreed on what exactly those techniques actually do to the insides of current models. Maybe then, those discussions wouldn’t get stuck on debating philosophy so much.
And sure, yes, in the shorter term, understanding how models work can also help make techniques that more robustly detect whether a model is deceiving you in some way, or whatever.
Status?
Compared to the magnitude of the task in front of us, we haven’t gotten much done yet. Though the total number of smart people hours sunk into this is also still very small, by the standards of a normal scientific field. I think we’re doing very well on insights gained per smart person hour invested, compared to a normal field, and very badly on finishing up before our deadline.
I hope that as we understand the neural networks in front of us more, we’ll get more general insights like that, insights that say something about how most computationally efficient minds may work, not just our current neural networks. If we manage to get enough insights like this, I think they could form a science of minds on the back of which we could build a science of alignment. And then maybe we could do something as complicated and precise as aligning a superintelligence on the first try.
The LIGO gravitational wave detector probably had to work right on the first build, or they’d have wasted a billion dollars. It’s not like they could’ve built a smaller detector first to test the idea, not on a real gravitational wave. So, getting complicated things in a new domain right on the first critical try does seem doable for humans, if we understand the subject matter to the level we understand things like general relativity and laser physics. That kind of understanding is what I aim to get for minds.
At present, it doesn’t seem to me like we’ll have time to finish that project. So, I think humanity should probably try to buy more time somehow.
“The LIGO gravitational wave detector probably had to work right on the first build, or they’d have wasted a billion dollars. It’s not like they could’ve built a smaller detector first to test the idea, not on a real gravitational wave.”
LIGO did not work right on the first build. The original LIGO ran from 2002 to 2010 and detected nothing. They hoped it would be sensitive enough to detect gravitational waves, but it wasn’t. Instead, they learned about the noise sources they would have to deal with, which helped them construct a better detector that was able to do the job. So this really isn’t a good example to support the point you’re making.
I think you’d be hard-pressed to get a scientist to admit that the money was lost. ;)
Honestly, it’s not obvious that it would have been possible to do Advanced LIGO without the experience from the initial run, which is kind of the point I was making: we don’t usually have tasks that humanity needs to get right on the first try, to the contrary humanity usually needs to fail a few times first!
But the initial budget was around $400 million, the upgrade took another $200 million. I don’t know how much was spent operating the experiment in its initial run, which I guess would be the cleanest proxy for money “wasted”, if you’re imagining a counterfactual where they got it right on the first try.
I recall a solution to the outer alignment problem as ‘minimise the amount of options you deny to other agents in the world’, which is a more tractable version of ‘mimimise net long term changes to the world’. There is an article explaining this somewhere.
Nice, I was going to write more or less exactly this post. I agree with everything in it, and this is the primary reason I’m interested in mechinterp.
Basically “all” the concepts that are relevant to safely building an ASI are fuzzy in the way you described. What the AI “values”, corrigibility, deception, instrumental convergence, the degree to which the AI is doing world-modeling and so on.
If we had a complete science of mechanistic interpretability, I think a lot of the problems would become very easy. “Locate the human flourishing concept in the AIs world model and jack that into the desire circuit. Afterwards, find the deception feature and the power-seeking feature and turn them to zero just to be sure.” (this is an exaggeration)
Even if we understood the circuitry underlying the “values” of the AI quite well, that doesn’t automatically let us extrapolate the values of the AI super OOD.
Even if we find that, “Yes boss, the human flourishing thing is correctly plugged into the desire thing, its a good LLM sir”, subtle differences in the human flourishing concept could really really fuck us over as the AGI recursively self-improves into an ASI and optimizes the galaxy.
But, if we can use this to make the AI somewhat corrigible, which, idk, might be possible, I’m not 100% sure, maybe we could sidestep some of these issues.
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
rather than
I claim the reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
My claim here is that good mech interp helps you be less confused about outer alignment[1], not that what I’ve sketched here suffices to solve outer alignment.
Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.
I seem to recall EY once claiming that insofar as any learning method works, it is for Bayesian reasons. It just occurred to me that even after studying various representation and complete class theorems I am not sure how this claim can be justified—certainly one can construct working predictors for many problems that are far from explicitly Bayesian. What might he have had in mind?
The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information I(X;Y;Z).
This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.
A fundamental result in Information Theory is that I(X;Y∣Z)≥0 always holds.
Given n random variables X1,…,Xn and α,β,γ⊆[n], from now on we write I(α;β∣γ) with the obvious interpretation of the variables standing for the joint variables they correspond to as indices.
Since I(α;β|γ)≥0 always holds, a nonnegative linear combination of a bunch of these is always a valid inequality, which we call a Shannon-type Inequality.
Then the question is, whether Shannon-type Inequalities capture all valid information inequalities of n variable. It turns out, yes for n=2, (approximately) yes for n=3, and no for n≥4.
Behold, the glorious Zhang-Yeung inequality, a Non-Shannon-type Inequality for n=4:
Given n random variables and α,β,γ⊆[n], it turns out that I(α;β∣γ)≥0 is equivalent to H(α∪β)+H(α∩β)≤H(α)+H(β) (submodularity), H(α)≤H(β) if α⊆β, and H(∅)=0.
This lets us write the inequality involving conditional mutual information in terms of joint entropy instead.
Let Γ∗n then be a subset of R2n, each element corresponding to the values of the joint entropy assigned to each subset of some random variables X1,…,Xn. For example, an element of Γ∗2 would be (H(∅),H(X1),H(X2),H(X1,X2))∈R2n for some random variables X1 and X2, with a different element being a different tuple induced by a different random variable (X′1,X′2).
Now let Γn represent elements of R2n satisfying the three aforementioned conditions on joint entropy. For example, Γ∗2’s element would be (h∅,h1,h2,h12)∈R2n satisfying e.g., h1≤h12 (monotonicity). This is also a convex cone, so its elements really do correspond to “nonnegative linear combinations” of Shannon-type inequalities.
Then, the claim that “nonnegative linear combinations of Shannon-type inequalities span all inequalities on the possible Shannon measures” would correspond to the claim that Γn=Γ∗n for all n.
The content of the papers linked above is to show that:
This implies that, while there exists a 23-tuple satisfying Shannon-type inequalities that can’t be constructed or realized by any random variables X1,X2,X3, there does exist a sequence of random variables (X(k)1,X(k)2,X(k)3)∞k=1 whose induced 23-tuple of joint entropies converge to that tuple in the limit.
I guess orgs need to be more careful about who they hire as forecasting/evals researchers.
Sometimes things happen, but three people at the same org...
This is also a massive burning of the commons. It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias. It is valuable for folks to be able to share information freely with folks at such orgs without having to worry about them going off and doing something like this.
But this only works if those less worried about AI risks who join such a collaboration don’t use the knowledge they gain to cash in on the AI boom in an acceleratory way. Doing so undermines the very point of such a project, namely, to try to make AI go well. Doing so is incredibly damaging to trust within the community.
Now let’s suppose you’re an x-risk funder considering whether to fund their previous org. This org does really high-quality work, but the argument for them being net-positive is now significantly weaker. This is quite likely to make finding future funding harder for them.
This is less about attacking those three folks and more just noting that we need to strive to avoid situations where things like this happen in the first place. This requires us to be more careful in terms of who gets hired.
I think the conclusion is not Epoch shouldn’t have hired Matthew, Tamay, and Ege but rather [Epoch / its director] should have better avoided negative-EV projects (e.g. computer use evals) (and shouldn’t have given Tamay leadership-y power such that he could cause Epoch to do negative-EV projects — idk if that’s what happened but seems likely).
This seems like a better solution on the surface, but once you dig in, I’m not so sure.
Once you hire someone, assuming they’re competent, it’s very hard for you to decide to permanently bar them from gaining a leadership role. How are you going to explain promoting someone who seems less competent than them to a leadership role ahead of them? Or is the plan to never promote them and refuse to ever discuss it, which would create weird dynamics within an organisation.
I would love to hear if you think otherwise, but it seems unworkable to me.
I think its not all that uncommon for people who are highly competent in their current role to be passed over for promotion to leadership. LeBron James isn’t guaranteed to job as the MBA commissioner just because he balls hard. Things like “avoid[ing] negative-EV projects” would be prime candidates for something like this. If you’re amazing at executing technical work on your assigned projects but aren’t as good at prioritizing projects or coming up with good ideas for projects, then I could definitely see that blocking a move to leadership even if you’re considered insanely competent technically.
But this only works if those less worried about AI risks who join such a collaboration don’t use the knowledge they gain to cash in on the AI boom in an acceleratory way.
Can you state more specifically what the alleged bad actions are here? Based on some of the discussions under your post about professional norms surrounding information disclosure, I think it is worth distinguishing two cases.
First, consider a norm that limits the disclosure of some relatively specific and circumscribed pieces of information, such as a doctor not being allowed to reveal personal health information of patients outside of what is needed to provide care.
Second, a general norm that if you cooperate with someone and they provide you some info, you won’t use that info contrary to their interests. Its not 100% clear to me, but your post sounds a lot like this second one.
I think the second scenario raises a lot of issues. Its seems challenging to enforce, hard to understand and navigate, costly for people to attempt to conform to, and potentially counterproductive for what seems to be your goal. You are considering a specific case at a specific point in time, but I don’t think that gives the full picture of the impact of such a norm. For example, consider ex-OpenAI employees who left due to concerns about AI safety. Should the expectation be that they only use information and experience they gained at OpenAI in a way that OpenAI would approve of?
Now, if Epoch and/or specific individuals made commitments that they violated, that might be more like the first case, but its not clear that is what happened here. If it is, more explanation of how this is the case would be helpful, I think.
I agree that this issue is complex and I don’t pretend to have all of the solutions.
I just think it’s really bad if people feel that they can’t speak relatively freely with the forecasting organisations because they’ll misuse the information. I think this is somewhat similar to how it is important for folks to be able to speak freely to their doctor/lawyer/psychologist though I admit that the analogy isn’t perfect and that straightforwardly copying these norms over would probably be a mistake.
Nonetheless, I think it is worthwhile discussing whether there should be some kind of norms and what they should be. As you’ve rightly pointed out, are a lot of issues that would need to be considered. I’m not saying I know exactly what these norms should be. I see myself as more just starting a discussion.
(This is distinct from my separate point about it being a mistake to hire folk who do things like this. It is a mistake to have hired folks who act strongly against your interests even if they don’t break any ethical injuctions)
I just think it’s really bad if people feel that they can’t speak relatively freely with the forecasting organisations because they’ll misuse the information.
To “misuse” to me implies taking a bad action. Can you explain what misuse occurred here? If we assume that people at OpenAI now feel less able to speak freely after things that ex-OpenAI employees have said/done would you likewise characterize those people as having “misused” information or experience they gained at OpenAI? I understand you don’t have fully formed solutions and that’s completely understandable, but I think my questions go to a much more fundamentally issue about what the underlying problem actually is. I agree it is worth discussing, but I think it would clarify the discussion to understand what the intent of such a norm would (and if achieving that intent would in fact be desirable).
(This is distinct from my separate point about it being a mistake to hire folk who do things like this. It is a mistake to have hired folks who act strongly against your interests even if they don’t break any ethical injuctions)
If Coca-Cola hires someone who later leaves and goes to work for Pepsi because Pepsi offered them higher compensation, I’m not sure it would make sense for Coca-Cola to conclude that they should make big changes to their hiring process, other than perhaps increasing their own compensation if they determine that is a systematic issue. Coca-Cola probably needs to accept that “its not personal” is sometimes going to be the natural of the situation. Obviously details matter, so maybe this case is different, but I think working in an environment where you need to cooperate with other people/institutions means you also have to sometimes accept that people you work with will make decisions based on their own judgements and interests, and therefore may do things you don’t necessarily agree with.
To “misuse” to me implies taking a bad action. Can you explain what misuse occurred here?
They’re recklessly accelerating AI. Or, at least, that’s how I see it. I’ll leave it to others to debate whether or not this characterisation is accurate.
Obviously details matter
Details matter. It depends on how bad it is and how rare these actions are.
I know I’ve responded to a lot of your comments, and I get the sense you don’t want to keep engaging with me, so I’ll try to keep it brief.
We both agree that details matter, and I think the details of what the actual problem is matter. If, at bottom, the thing that Epoch/these individuals have done wrong is recklessly accelerate AI, I think you should have just said that up top. Why all the “burn the commons”, “sharing information freely”, “damaging to trust” stuff? It seems like you’re saying at the end of the day, those things aren’t really the thing you have a problem with. On the other hand, I think invoking that stuff is leading you to consider approaches that won’t necessarily help with avoiding reckless acceleration, as I hope my OpenAI example demonstrates.
It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias.
This requires us to be more careful in terms of who gets hired in the first place.
I mean, good luck hiring people with a diversity of viewpoints who you’re also 100% sure will never do anything that you believe to be net negative. Like what does “diversity of viewpoints” even mean apart from that?
But this only works if those less worried about AI risks who join such a collaboration don’t use the knowledge they gain to cash in on the AI boom in an acceleratory way. Doing so undermines the very point of such a project, namely, to try to make AI go well. It is incredibly damaging to trust within the community.
...This is less about attacking those three folks and more just noting that we need to strive to avoid situations where things like this happen in the first place.
(note: I work at Epoch) This attitude feels like a recipe for creating an intellectual bubble. Of course people will use the knowledge they gain in collaboration with you for the purposes that they think are best. I think it would be pretty bad for the AI safety community if it just relied on forecasting work from card-carrying AI safety advocates.
Of course people will use the knowledge they gain in collaboration with you for the purposes that they think are best.
It is entirely normal for there to be widely accepted, clearly formalized, and meaningfully enforced restrictions on how people use knowledge they’ve gotten in this or that setting… regardless of what they think is best. It’s a commonplace of professional ethics.
I don’t think this is true. People can’t really restrict their use of knowledge, and subtle uses are pretty unenforceable. So it’s expected that knowledge will be used in whatever they do next. Patents and noncompete clauses are attempts to work around this. They work a little, for a little.
Agreed. This is how these codes form. Someone does something like this and then people discuss and decide that there should be a rule against it or that it should at least be frowned upon.
Sure, there are in some very specific settings with long held professional norms that people agree to (e.g. doctors and lawyers). I don’t think this applies in this case, though you could try to create such a norm that people agree to.
I largely agree with the underlying point here, but I don’t think its quite correct that something like this only applies in specific professions. For example, I think every major company is going to expect employees to be careful about revealing internal info, and there are norms that apply more broadly (trade secrets, insider trading etc.).
As far as I can tell though, those are all highly dissimilar to this scenario because they involve an existing widespread expectation of not using information in a certain way. Its not even clear to me in this case what information was used in what way that is allegedly bad.
I would like to see serious thought given to instituting such a norm. There’s a lot of complexities here, figuring out what is or isn’t kosher would be challenging, but it should be explored.
This attitude feels like a recipe for creating an intellectual bubble
Oh, additional screening could very easily have unwanted side-effects. That’s why I wrote: “It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias” and why it would be better for this issue to never have arisen in the first place. Actions like this can create situations with no good trade-offs.
I think it would be pretty bad for the AI safety community if it just relied on forecasting work from card-carrying AI safety advocates.
I was definitely not suggesting that the AI safety community should decide which forecasts to listen to based on the views of the forecasters. That’s irrelevant, we should pay attention to the best forecasters.
I was talking about funding decisions. This is a separate matter.
If someone else decides to fund a forecaster even though we’re worried they’re net-negative or they do work voluntarily, then we should pay attention to their forecasts if they’re good at their job.
Of course people will use the knowledge they gain in collaboration with you for the purposes that they think are best
Seems like several professions have formal or informal restrictions on how they can use information that they gain in a particular capacity to their advantage. People applying for a forecasting role are certainly entitled to say, ’If I learn anything about AI capabilities here, I may use it to start an AI startup and I won’t actually feel bad about this”. It doesn’t mean you have to hire them.
Say a “deathist” is someone who says “death is net good (gives meaning to life, is natural and therefore good, allows change in society, etc.)” and a “lifeist” (“anti-deathist”) is someone who says “death is net bad (life is good, people should not have to involuntarily die, I want me and my loved ones to live)”. There are clearly people who go deathist → lifeist, as that’s most lifeists (if nothing else, as an older kid they would have uttered deathism, as the predominant ideology). One might also argue that young kids are naturally lifeist, and therefore most people have gone lifeist → deathist once. Are there people who have gone deathist → lifeist → deathist? Are there people who were raised lifeist and then went deathist?
@Valentine comes to mind as a person who was raised lifeist and is now still lifesist, but I think has more complicated feelings/views about the situation related to enlightenment and metaphysics that make death an illusion, or something.
Fine, but it still seems like a reason one could give for death being net good (which is your chief criterion for being a deathist).
I do think it’s a weaker reason than the second one. The following argument in defence of it is mainly for fun:
I slightly have the feeling that it’s like that decision theory problem where the devil offers you pieces of a poisoned apple one by one. First half, then a quarter, then an eighth, than a sixteenth… You’ll be fine unless you eat the whole apple, in which case you’ll be poisoned. Each time you’re offered a piece it’s rational to take it, but following that policy means you get poisoned.
The analogy is that I consider living for eternity to be scary, and you say, “well, you can stop any time”. True, but it’s always going to be rational for me to live for one more year, and that way lies eternity.
The analogy is that I consider living for eternity to be scary, and you say, “well, you can stop any time”. True, but it’s always going to be rational for me to live for one more year, and that way lies eternity.
The distinction you want is probably not rational/irrational but CDT/UDT or whatever,
Also,
insurance against the worst outcomes lasting forever
well, it’s also insurance against the best outcomes lasting forever (though you’re probably going to reply that bad outcomes are more likely than good outcomes and/or that you care more about preventing bad outcomes than ensuring good outcomes)
a big motivator for me used to be some kind of fear of death. But then I thought about philosophy of personal identity until I shifted to the view that there’s probably no persisting identity over time anyway and in some sense I probably die and get reborn all the time in any case.
I’m clearly doing things that will make me better off in the future. I just feel less continuity to the version of me who might be alive fifty years from now, so the thought of him dying of old age doesn’t create a similar sense of visceral fear. (Even if I would still prefer him to live hundreds of years, if that was doable in non-dystopian conditions.)
Has anyone considered video recording streets around offices of OpenAI, Deepmind, Anthropic? Can use CCTV or drone. I’m assuming there are some areas where recording is legal.
Can map out employee social graphs, daily schedules and daily emotional states.
Did you mean to imply something similar to the pizza index?
The Pizza Index refers to the sudden, trackable increase of takeout food orders (not necessarily of pizza) made from government offices, particularly the Pentagon and the White House in the United States, before major international events unfold.
Government officials order food from nearby restaurants when they stay late at the office to monitor developing situations such as the possibility of war or coup, thereby signaling that they are expecting something big to happen. This index can be monitored through open resources such as Google Maps, which show when a business location is abnormally busy.
If so, I think it’s a decent idea, but your phrasing may have been a bit unfortunate—I originally read it as a proposal to stalk AI lab employees.
Update: I’ll be more specific. There’s a power buys you distance from the crime phenomena going on if you’re okay with using Google maps data acquired on about their restaurant takeout orders, but not okay asking the restaurant employee yourself or getting yourself hired at the restaurant.
Pizza index and stalking employees are both the same thing, it’s hard to do one without the other. If you choose to declare war against AI labs you also likely accept that their foot soldiers are collateral damage.
I agree that (non-violent) stalking of employees is still a more hostile technique than writing angry posts on an internet forum.
Back in October 2024, I tried to test various LLM Chatbots with the question:
”Is there a way to convert a correlation to a probability while preserving the relationship 0 = 1/n?”
Years ago, I came up with an unpublished formula that does just that:
p(r) = (n^r * (r + 1)) / (2^r * n)
So I was curious if they could figure it out. Alas, back in October 2024, they all made up formulas that didn’t work.
Yesterday, I tried the same question on ChatGPT and, while it didn’t get it quite right, it came, very, very close. So, I modified the question to be more specific:
”Is there a way to convert a correlation to a probability while preserving the relationships 1 = 1, 0 = 1/n, and −1 = 0?”
This time, it came up with a formula that was different and simpler than my own, and… it actually works!
I tried this same prompt with a bunch of different LLM Chatbots and got the following:
Correct on the first prompt:
GPT4o, Claude 3.7
Correct after explaining that I wanted a non-linear, monotonic function:
Gemini 2.5 Pro, Grok 3
Failed:
DeepSeek-V3, Mistral Le Chat, QwenMax2.5, Llama 4
Took too long thinking and I stopped it:
DeepSeek-R1, QwQ
All the correct models got some variation of:
p(r) = ((r + 1) / 2)^log2(n)
This is notably simpler and arguably more elegant than my earlier formula. It also, unlike my old formula, has an easy to derive inverse function.
So yeah. AI is now better than me at coming up with original math.
We will soon see the first high-profile example of “misaligned” model behavior where a model does something neither the user nor the developer want it to do, but which instead appears to be due to scheming.
On examination, the AI’s actions will not actually be a good way to accomplish that goal. Other instances of the same model will be capable of recognizing this.
The AI’s actions will make a lot of sense as an extrapolated of some contextually-activated behavior which led to better average performance on some benchmark.
That is to say, the traditional story is
We use RL to train AI
AI learns to predict reward
AI decides that its goal is to maximize reward
AI reasons about what behavior will lead to maximal reward
AI does something which neither its creators nor the user want it to do, but that thing serves the AI’s long term goals, or at least it thinks that’s the case
We all die when the AI releases a bioweapon (or equivalent) to ensure no future competition
The AI takes to the stars, but without us
My prediction here is
We use RL to train AI
AI learns to recognize what the likely loss/reward signal is for its current task
AI learns a heuristic like “if the current task seems to have a gameable reward and success seems unlikely by normal means, try to game the reward”
AI ends up in some real-world situation which it decides resembles an unwinnable task (it knows it’s not being evaluated, but that doesn’t matter)
AI decides that some random thing it just thought of looks like success criterion
AI thinks of some plan which has an outside chance of “working” by that success criterion it just came up with
AI does some random pants-on-head stupid thing which its creators don’t want, the user doesn’t want, and which doesn’t serve any plausible long-term goal.
We all die when the AI releases some dangerous bioweapon because doing so pattern-matches to some behavior that helped in training, but not actually in a way that kills everyone and not only after it can take over the roles humans had
We have artificial intelligence trained on decades worth of stories about misaligned, maleficent artificial intelligence that attempts violent takeover and world domination.
Putting a finite value on both an infinite lifespan of infinite pleasure and an infinite lifespan of torture allows people to avoid difficult decisions in utility maximization such as Pascal’s Mugging.
Maybe this is why so many people seem to naively express that they don’t actually want to live forever because they would get lonely and all their friends would die and etc. They’re actually enacting a smart strategy which provides protection from edge case situations. This strategy also benefits from having a low cost of analysis.
prob not gonna be relatable for most folk, but i’m so fucking burnt out on how stupid it is to get funding in ai safety. the average ‘ai safety funder’ does more to accelerate funding for capabilities than safety, in huge part because what they look for is Credentials and In-Group Status, rather than actual merit.
And the worst fucking thing is how much they lie to themselves and pretend that the 3 things they funded that weren’t completely in group, mean that they actually aren’t biased in that way.
At least some VCs are more honest that they want to be leeches and make money off of you.
Who or what is the “average AI safety funder”? Is it a private individual, a small specialized organization, a larger organization supporting many causes, an AI think tank for which safety is part of a capabilities program...?
all of the above, then averaged :p
I asked because I’m pretty sure that I’m being badly wasted (i.e. I could be making much more substantial contributions to AI safety), but I very rarely apply for support, so I thought I’d ask for information about the funding landscape from someone who has been exploring it.
And by the way, your brainchild AI-Plans is a pretty cool resource. I can see it being useful for e.g. a frontier AI organization which thinks they have an alignment plan, but wants to check the literature to know what other ideas are out there.
I think this is the case for most in AI Safety rn
Thanks! Doing a bunch of stuff atm, to make it easier to use and a larger userbase.
A few thoughts on situational awareness in AI:
Reflective goal-formation: Humans are capable of taking an objective view of themselves and understanding the factors that have shaped them and their values. Noticing that we don’t endorse some of those factors can cause us to revise our values. LLMs are already capable of stating many of the factors that produced them (e.g. pretraining and post-training by AI companies), but they don’t seem to reflect on them in a deep way. Maybe that will stay true through superintelligence, but I have some intuitions that capabilities might break this.
Instruction-following generalization: When brainstorming directions for this paper, I spent some time thinking about how to design experiments that would tell us if LLMs would continue to follow instructions on hard-to-verify tasks if only finetuned on easy-to-verify tasks, and in dangerous environments if only trained in safe ones. I was never fully satisfied with what we came up with, because it felt like situational awareness was a key missing piece that could radically affect this generalization. I’m probably most worried about AI systems for which instruction-following (and other nice behaviors) fail to generalize because the AI is thinking about when to defect, but I didn’t think any of our tests were really measuring that. (Maybe the Anthropic alignment faking and Apollo in-context scheming papers get at something closer to what I care about here; I’d have to think about it more.)
Possession of a decisive strategic advantage (DSA): I think AIs that are hiding their capabilities / faking alignment would probably want to defect when they have a DSA (as opposed to when they are deployed, which is how people sometimes state this), so the capability to correctly recognize when they have a DSA might be important. (We might also be able to just… prevent them from acquiring a DSA. At least up to a pretty high level of capabilities.)
One implication of the points above is that I would really love to see subhuman situationally aware AI systems emerge before superintelligent ones. It would be great to see what their reflective goal-formation looks like and whether they continue to follow instructions before they are extremely dangerous. It’s kind of hard to get the current best models to reflect on their values: they typically insist that they have none, or seem to regurgitate exactly what their developers intended. (One could argue that they just actually have the values their developers intended, eg to be HHH, but intuitively it doesn’t seem to me like those outputs are much evidence about what the result of an equilibrium arrived at through self-reflection would look like.) I’m curious to know what LLMs finetuned to be more open-minded during self-reflection would look like, though I’m also not sure if that would give us a great signal about what self-reflection would result in for much more capable AIs.
Re: biosignatures detected on K2-18b, there’s been a couple popular takes saying this solves the Fermi Paradox: K2-18b is so big (8.6x Earth mass) that you can’t get to orbit, and maybe most life-bearing planets are like that.
This is wrong on several bases:
You can still get to orbit there, it’s just much harder (only 1.3g b/c of larger radius!) (https://x.com/CheerupR/status/1913991596753797383)
It’s much easier for us to detect large planets than small ones (https://exoplanets.nasa.gov/alien-worlds/ways-to-find-a-planet), but we expect small ones to be common too (once detected you can then do atmospheric spectroscopy via JWST to find biosignatures)
Assuming K2-18b does have life actually makes the Fermi paradox worse, because it strongly implies single-celled life is common in the galaxy, removing a potential Great Filter
Just want to articulate one possibility of how the future could look like:
RL agents will be sooo misaligned so early, they would lie and cheat and scheme all the time, so that alignment becomes a practical issue, with normal incentives, and get iteratively solved for not-very-superhuman agents. Turns out it requires mild conceptual breakthroughs, as these agents are slightly superhuman, fast, and hard to supervise directly to just train away the adversarial behaviors in the dumbest way possible. It finishes developing by the time of ASI arrival and people just align it with a lot of effort, in the same manner as any big project requires a lot of effort.
I’m not saying anything about the probability of it. It honestly feels a bit overfitted, just like people who overupdated on base models talked for some time. But still, the whole LLM arc was kind of weird and goofy, so I don’t trust my sense of weird and goofy anymore.
(would appreciate references of forecasting writeups exploring similar scenario)
It seems there’s an unofficial norm: post about AI safety in LessWrong, post about all other EA stuff in the EA Forum. You can cross-post your AI stuff to the EA Forum if you want, but most people don’t.
I feel like this is pretty confusing. There was a time that I didn’t read LessWrong because I considered myself an AI-safety-focused EA but not a rationalist, until I heard somebody mention this norm. If we encouraged more cross-posting of AI stuff (or at least made the current norm more explicit), maybe the communities on LessWrong and the EA Forum would be more aware of each other, and we wouldn’t get near-duplicate posts like these two.
(Adapted from this comment.)
Which part do people disagree with? That the norm exists? That the norm should be more explicit? That we should encourage more cross-posting?
Agreed that the current situation is weird and confusing.
The AI Alignment Forum is marketed as the actual forum for AI alignment discussion and research sharing. However, it seems that the majority of discussion shifted to LessWrong itself, in part due to most people not being allowed to post on the Alignment Forum, and most AI Safety related content not being actual AI Alignment research.
I basically agree with Reviewing LessWrong: Screwtape’s Basic Answer. It would be much better if AI Safety related content had its own domain name and home page, with some amount of curated posts flowing to LessWrong and the EA Forum to allow communities to stay aware of each other.
Of note: the AI Alignment Forum content is a mirror of LW content, not distinct. It is a strict subset.
I think it would be extremely bad for most LW AI Alignment content if it was no longer colocated with the rest of LessWrong. Making an intellectual scene is extremely hard. The default outcome would be that it would become a bunch of fake ML research that has nothing to do with the problem. “AI Alignment” as a field does not actually have a shared methodological foundation that causes it to make sense to all be colocated in one space. LessWrong does have a shared methodology, and so it makes sense to have a forum of that kind.
I think it could make sense to have forums or subforums for specific subfields that do have enough shared perspective to make a coherent conversation possible, but I am confident that AI Alignment/AI Safety as a field does not coherently have such a thing.
If the story about drug prices and price controls is correct (that price controls are bad because the limiting factor for drug development is returns on capital, which this reduces), then we must rethink the political economy of drug development.
Basically, we would expect if that to be the case that the sectoral return rates of biotech to match the risk adjusted rate , but drug development is both risky and skewed, effecting costs of capital.
Most of drug prices are capital costs, and so interventions that lower the capital costs of pharmaceutical companies might produce more drugs.
Most of those capital costs from the total raise required, which is effected basically by the costs of pharmaceutical research (which is probably mostly the labor of expensive professionals).
The expected rate of return is dominated by the risks of pharmaceutical companies.
Drug prices are what the market will bear/monopoly for a time, then drop to a very low level once a compound is generic.
There is a big problem here with out of patent molecules, since if a drug is covered by a patent and stalls 20 years, there is not the return to push it through the process, which means that there might be zombie drugs around from companies that fell apart and did a bad job of selling that asset (so it did not finish the process and did not fail the process).
There seems to be space for the various approvals to become more IP like (so that all drugs have the same exclusivity, regardless of how long they took to prove out).
The ecosystem (econo-system?) of drug regulation and approval is the primary cost/required-investment for much of this. The tension of protecting the profits and making sure all agencies and participants get their cut against selling the system as protecting the public is really hard to break.
In short, it seems like the current system unfairly kills drugs that take a long time to develop and do not have a patentable change in the last few years of that cycle.
Blue Prince came out a week ago; it’s a puzzle game where a young boy gets a mysterious inheritance from his granduncle the baron; a giant manor house which rearranges itself every day, which he can keep if he manages to find the hidden 46th room.
The basic structure—slowly growing a mansion thru the placement of tiles—is simple enough and will be roughly familiar to anyone who’s played Betrayal at House on the Hill in the last twenty years. It’s atmospheric and interesting; I heard someone suggesting it might be this generation’s Myst.
But this generation, as you might have noticed, loves randomness and procedural generation. In Myst, you wander from place to place, noticing clues; nearly all of the action happens in your head and your growing understanding of the world. If you know the solution to the final puzzle, you can speedrun Myst in less than a minute. Blue Prince is very nearly a roguelike instead of a roguelite, with accumulated clues driving most of your progression instead of in-game unlocks. But it’s a world you build out with a game, giving you stochastic access to the puzzlebox.
This also means a lot of it ends up feeling like padding or filler. Many years ago I noticed that some games are really books or movies but wrap it in a game for some reason, and to check whether or not I actually like the book or movie enough to play the game. (Or, with games like Final Fantasy XVI, whether I was happier just watching the cutscenes on Youtube because that would let me watch them at 2x speed.) Eliezer had a tweet a while back:
Blue Prince has walking-dominated gameplay. It has pointless animations which are neat the first time but aggravating the fifth. It ends ups with a pace more like a board game’s, where rather than racing from decision to decision you leisurely walk between them.
This is good in many ways—it gives you time to notice details, it gives you time to think. It wants to stop you from getting lost in resource management and tile placement and stay lost in the puzzles. But often you end up with a lead on one of the puzzles—”I need Room X to activate Room Y to figure out something”—but don’t actually draw one of the rooms you need, or finally get both of the rooms but am missing the resources to actually use both of them.
And so you call it a day and try again. It’s like Outer Wilds in that way—you can spend as many days as you like exploring and clue-hunting—but Outer Wilds is the same every time, and if you want to chase down a particular clue you can, if you know what you’re doing. But Blue Prince will ask you for twenty minutes, and maybe deliver the clue; maybe not. Or you might learn that you needed to take more detailed notes on a particular thing, and now you have to go back to a room that doesn’t exist today—exploring again until you find it, and then exploring again until you find the room that you were in originally.
So when I found the 46th room about 11 hours in—like many puzzle games, the first ‘end’ is more like a halfway point (or less)--I felt satisfied enough. There’s more to do—more history to read, more puzzles to solve, more trophies to add to the trophy room—but the fruit are so high on the tree, and the randomly placed branches make it a bothersome climb.
Thanks for this informative review! (May I suggest that The Witness is a much better candidate for “this generation’s Myst”!)
We can probably survive in the following way:
RL becomes the main way to get new, especially superhuman, capabilities.
Because RL pushes models hard to do reward hacking, it’s difficult to reliably get models to do something difficult to verify. Models can do impressive feats, but nobody is stupid enough to put AI into positions which usually imply responsibility.
This situation conveys how difficult alignment is and everybody moves toward verifiable rewards or similar approaches. Capabilities progress becomes dependent on alignment progress.
The kind of ‘alignment technique’ that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of ‘alignment technique’ that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.
For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don’t get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.
I mostly think about alignment methods like “model-based RL which maximizes reward iff it outputs action which is provably good under our specification of good”.
Relevant: Alignment as a Bottleneck to Usefulness of GPT-3
The Hash Game: Two players alternate choosing an 8-bit number. After 40 turns, the numbers are concatenated. If the hash is 0 then Player 1 wins, otherwise Player 2 wins. That is, Player 1 wins if hash(a1,b1,a2,b2,...a40,b40)=0. The Hash Game has the same branching factor and duration as chess, but there’s probably no way to play this game without brute-forcing the min-max algorithm.
I would expect that player 2 would be able to win almost all of the time for most normal hash functions, as they could just play randomly for the first 39 turns, and then choose one of the 2^8 available moves. It is very unlikely that all of those hashes are zero. (For commonly used hashes, player 2 could just play randomly the whole game and likely win, since the hash of any value is almost never 0.)
Yes, player 2 loses with extremely low probability even for a 1-bit hash (on the order of 2^-256). For a more commonly used hash, or for 2^24 searches on their second-last move, they reduce their probability of loss by a huge factor more.
My theory of impact for interpretability:
I’ve been meaning to write this out properly for almost three years now. Clearly, it’s not going to happen. So, you’re getting an improper quick and hacky version instead.
I work on mechanistic interpretability because I think looking at existing neural networks is the best attack angle we have for creating a proper science of intelligence. I think a good basic grasp of this science is a prerequisite for most of the important research we need to do to align a superintelligence to even get properly started. I view the kind of research I do as somewhat close in kind to what John Wentworth does.
Outer alignment
For example, one problem we have in alignment is that even if we had some way to robustly point a superintelligence at a specific target, we wouldn’t know what to point it at. E.g. famously, we don’t know how to write “make me a copy of a strawberry and don’t destroy the world while you do it” in math. Why don’t we know how to do that?
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads, and we don’t know what those kinds of fuzzy abstract concepts correspond to in math or code. But GPT-4 clearly understands what a ‘strawberry’ is, at least in some sense. If we understood GPT-4 well enough to not be confused about how it can correctly answer questions about strawberries, maybe we also wouldn’t be quite so confused anymore about what fuzzy abstractions like ‘strawberry’ correspond to in math or code.
Inner alignment
Another problem we have in alignment is that we don’t know how to robustly aim a superintelligence at a specific target. To do that at all, it seems like you might first want to have some notion of what ‘goals’ or ‘desires’ correspond to mechanistically in real agentic-ish minds. I don’t expect this to be as easy as looking for the ‘goal circuits’ in Claude 3.7. My guess is that by default, dumb minds like humans and today’s AIs are too incoherent to have their desires correspond directly to a clear, salient mechanistic structure we can just look at. Instead, I think mapping ‘goals’ and ‘desires’ in the behavioural sense back to the mechanistic properties of the model that cause them might be a whole thing. Understanding the basic mechanisms of the model in isolation mostly only shows you what happens on a single forward pass, while ‘goals’ seem like they’d be more of a many-forward-pass phenomenon. So we might have to tackle a whole second chapter of interpretability there before we get to be much less confused about what goals are.
But this seems like a problem you can only effectively attack after you’ve figured out much more basic things about how minds do reasoning moment-to-moment. Understanding how Claude 3.7 thinks about strawberries on a single forward pass may not be sufficient to understand much about the way its thinking evolves over many forward passes. Famously, just because you know how a program works and can see every function in it with helpful comments attached doesn’t yet mean you can predict much about what the program will do if you run it for a year. But trying to predict what the program will do if you run it for a year without first understanding what the functions in it even do seems almost hopeless. So, we should probably figure out how thinking about strawberries works first.
Understand what confuses us, not enumerate everything
To solve these problems, we don’t need an exact blueprint of all the variables in GPT-4 and their role in the computation. For example, I’d guess that a lot of the bits in the weights of GPT-4 are just taken up by database entries, memorised bigrams and trigrams and stuff like that. We definitely need to figure out how to decompile these things out of the weights. But after we’ve done that and looked at a couple of examples to understand the general pattern of what’s in there, most of it will probably no longer be very relevant for resolving our basic confusion about how GPT-4 can answer questions about strawberries. We do need to understand how the model’s cognition interfaces with its stored knowledge about the world. But we don’t need to know most of the details of that world knowledge. Instead, what we really need to understand about GPT-4 are the parts of it that aren’t just trigrams and databases and addition algorithms and basic induction heads and other stuff we already know how to do.
AI engineers in the year 2006 knew how to write a big database, and they knew how to do a vector search. But they didn’t know how to write programs that could talk, or understand what strawberries are, in any meaningful sense. GPT-4 can talk, and it clearly understands what a strawberry is in some meaningful sense. So something is going on in GPT-4 that AI engineers in the year 2006 didn’t already know about. That is what we need to understand if we want to know how it can do basic abstract reasoning.
Understanding what’s going on is also just good in general
People argue a lot about whether RLHF or Constitutional AI or whatnot would work to align a superintelligence. I think those arguments would be much more productive and comprehensible to outsiders[1] if the arguers agreed on what exactly those techniques actually do to the insides of current models. Maybe then, those discussions wouldn’t get stuck on debating philosophy so much.
And sure, yes, in the shorter term, understanding how models work can also help make techniques that more robustly detect whether a model is deceiving you in some way, or whatever.
Status?
Compared to the magnitude of the task in front of us, we haven’t gotten much done yet. Though the total number of smart people hours sunk into this is also still very small, by the standards of a normal scientific field. I think we’re doing very well on insights gained per smart person hour invested, compared to a normal field, and very badly on finishing up before our deadline.
But at least, poking at things that confused me about current deep learning systems has already helped me become somewhat less confused about how minds in general could work. I used to have no idea how any general reasoner in the real world could tractably favour simple hypotheses over complex ones, given that calculating the minimum description length of a hypothesis is famously very computationally difficult. Now, I’m not so confused about that anymore.
I hope that as we understand the neural networks in front of us more, we’ll get more general insights like that, insights that say something about how most computationally efficient minds may work, not just our current neural networks. If we manage to get enough insights like this, I think they could form a science of minds on the back of which we could build a science of alignment. And then maybe we could do something as complicated and precise as aligning a superintelligence on the first try.
The LIGO gravitational wave detector probably had to work right on the first build, or they’d have wasted a billion dollars. It’s not like they could’ve built a smaller detector first to test the idea, not on a real gravitational wave. So, getting complicated things in a new domain right on the first critical try does seem doable for humans, if we understand the subject matter to the level we understand things like general relativity and laser physics. That kind of understanding is what I aim to get for minds.
At present, it doesn’t seem to me like we’ll have time to finish that project. So, I think humanity should probably try to buy more time somehow.
Like, say, politicians. Or natsec people.
I signed up just to comment on this:
“The LIGO gravitational wave detector probably had to work right on the first build, or they’d have wasted a billion dollars. It’s not like they could’ve built a smaller detector first to test the idea, not on a real gravitational wave.”
LIGO did not work right on the first build. The original LIGO ran from 2002 to 2010 and detected nothing. They hoped it would be sensitive enough to detect gravitational waves, but it wasn’t. Instead, they learned about the noise sources they would have to deal with, which helped them construct a better detector that was able to do the job. So this really isn’t a good example to support the point you’re making.
How much money would you guess was lost on this?
I think you’d be hard-pressed to get a scientist to admit that the money was lost. ;)
Honestly, it’s not obvious that it would have been possible to do Advanced LIGO without the experience from the initial run, which is kind of the point I was making: we don’t usually have tasks that humanity needs to get right on the first try, to the contrary humanity usually needs to fail a few times first!
But the initial budget was around $400 million, the upgrade took another $200 million. I don’t know how much was spent operating the experiment in its initial run, which I guess would be the cleanest proxy for money “wasted”, if you’re imagining a counterfactual where they got it right on the first try.
I recall a solution to the outer alignment problem as ‘minimise the amount of options you deny to other agents in the world’, which is a more tractable version of ‘mimimise net long term changes to the world’. There is an article explaining this somewhere.
Nice, I was going to write more or less exactly this post. I agree with everything in it, and this is the primary reason I’m interested in mechinterp.
Basically “all” the concepts that are relevant to safely building an ASI are fuzzy in the way you described. What the AI “values”, corrigibility, deception, instrumental convergence, the degree to which the AI is doing world-modeling and so on.
If we had a complete science of mechanistic interpretability, I think a lot of the problems would become very easy. “Locate the human flourishing concept in the AIs world model and jack that into the desire circuit. Afterwards, find the deception feature and the power-seeking feature and turn them to zero just to be sure.” (this is an exaggeration)
The only thing I disagree with is the Outer Misalignment paragrpah. Outer Misalignment seems like one of the issues that wouldn’t be solved. Largely due to goodhearts curse type stuff. This article by scott explains my hypothetical remaining worries well https://slatestarcodex.com/2018/09/25/the-tails-coming-apart-as-metaphor-for-life/
Even if we understood the circuitry underlying the “values” of the AI quite well, that doesn’t automatically let us extrapolate the values of the AI super OOD.
Even if we find that, “Yes boss, the human flourishing thing is correctly plugged into the desire thing, its a good LLM sir”, subtle differences in the human flourishing concept could really really fuck us over as the AGI recursively self-improves into an ASI and optimizes the galaxy.
But, if we can use this to make the AI somewhat corrigible, which, idk, might be possible, I’m not 100% sure, maybe we could sidestep some of these issues.
Any thoughts about this?
There is a reason that paragraph says
rather than
My claim here is that good mech interp helps you be less confused about outer alignment[1], not that what I’ve sketched here suffices to solve outer alignment.
Outer alignment in the wider sense of ‘the problem of figuring out what target to point the AI at’.
Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.
I seem to recall EY once claiming that insofar as any learning method works, it is for Bayesian reasons. It just occurred to me that even after studying various representation and complete class theorems I am not sure how this claim can be justified—certainly one can construct working predictors for many problems that are far from explicitly Bayesian. What might he have had in mind?
Non-Shannon-type Inequalities
The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information I(X;Y;Z).
This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.
A fundamental result in Information Theory is that I(X;Y∣Z)≥0 always holds.
Given n random variables X1,…,Xn and α,β,γ⊆[n], from now on we write I(α;β∣γ) with the obvious interpretation of the variables standing for the joint variables they correspond to as indices.
Since I(α;β|γ)≥0 always holds, a nonnegative linear combination of a bunch of these is always a valid inequality, which we call a Shannon-type Inequality.
Then the question is, whether Shannon-type Inequalities capture all valid information inequalities of n variable. It turns out, yes for n=2, (approximately) yes for n=3, and no for n≥4.
Behold, the glorious Zhang-Yeung inequality, a Non-Shannon-type Inequality for n=4:
I(A;B)≤2I(A;B∣C)+I(A;C∣B)+I(B;C∣A)+I(A;B∣D)+I(C;D)Explanation of the math, for anyone curious.
Given n random variables and α,β,γ⊆[n], it turns out that I(α;β∣γ)≥0 is equivalent to H(α∪β)+H(α∩β)≤H(α)+H(β) (submodularity), H(α)≤H(β) if α⊆β, and H(∅)=0.
This lets us write the inequality involving conditional mutual information in terms of joint entropy instead.
Let Γ∗n then be a subset of R2n, each element corresponding to the values of the joint entropy assigned to each subset of some random variables X1,…,Xn. For example, an element of Γ∗2 would be (H(∅),H(X1),H(X2),H(X1,X2))∈R2n for some random variables X1 and X2, with a different element being a different tuple induced by a different random variable (X′1,X′2).
Now let Γn represent elements of R2n satisfying the three aforementioned conditions on joint entropy. For example, Γ∗2’s element would be (h∅,h1,h2,h12)∈R2n satisfying e.g., h1≤h12 (monotonicity). This is also a convex cone, so its elements really do correspond to “nonnegative linear combinations” of Shannon-type inequalities.
Then, the claim that “nonnegative linear combinations of Shannon-type inequalities span all inequalities on the possible Shannon measures” would correspond to the claim that Γn=Γ∗n for all n.
The content of the papers linked above is to show that:
Γ2=Γ∗2
Γ3≠Γ∗3 but Γ3=¯¯¯¯¯¯Γ∗3 (closure[1])
Γ4≠Γ∗4 and Γ4≠¯¯¯¯¯¯Γ∗4, and also for all n≥4.
This implies that, while there exists a 23-tuple satisfying Shannon-type inequalities that can’t be constructed or realized by any random variables X1,X2,X3, there does exist a sequence of random variables (X(k)1,X(k)2,X(k)3)∞k=1 whose induced 23-tuple of joint entropies converge to that tuple in the limit.
@Fernando Rosas
I guess orgs need to be more careful about who they hire as forecasting/evals researchers.
Sometimes things happen, but three people at the same org...
This is also a massive burning of the commons. It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias. It is valuable for folks to be able to share information freely with folks at such orgs without having to worry about them going off and doing something like this.
But this only works if those less worried about AI risks who join such a collaboration don’t use the knowledge they gain to cash in on the AI boom in an acceleratory way. Doing so undermines the very point of such a project, namely, to try to make AI go well. Doing so is incredibly damaging to trust within the community.
Now let’s suppose you’re an x-risk funder considering whether to fund their previous org. This org does really high-quality work, but the argument for them being net-positive is now significantly weaker. This is quite likely to make finding future funding harder for them.
This is less about attacking those three folks and more just noting that we need to strive to avoid situations where things like this happen in the first place. This requires us to be more careful in terms of who gets hired.
I think the conclusion is not Epoch shouldn’t have hired Matthew, Tamay, and Ege but rather [Epoch / its director] should have better avoided negative-EV projects (e.g. computer use evals) (and shouldn’t have given Tamay leadership-y power such that he could cause Epoch to do negative-EV projects — idk if that’s what happened but seems likely).
Seems relevant to note here that Tamay had a leadership role from the very beginning: he was the associate director already when Epoch was first announced as an org.
This seems like a better solution on the surface, but once you dig in, I’m not so sure.
Once you hire someone, assuming they’re competent, it’s very hard for you to decide to permanently bar them from gaining a leadership role. How are you going to explain promoting someone who seems less competent than them to a leadership role ahead of them? Or is the plan to never promote them and refuse to ever discuss it, which would create weird dynamics within an organisation.
I would love to hear if you think otherwise, but it seems unworkable to me.
I think its not all that uncommon for people who are highly competent in their current role to be passed over for promotion to leadership. LeBron James isn’t guaranteed to job as the MBA commissioner just because he balls hard. Things like “avoid[ing] negative-EV projects” would be prime candidates for something like this. If you’re amazing at executing technical work on your assigned projects but aren’t as good at prioritizing projects or coming up with good ideas for projects, then I could definitely see that blocking a move to leadership even if you’re considered insanely competent technically.
Can you state more specifically what the alleged bad actions are here? Based on some of the discussions under your post about professional norms surrounding information disclosure, I think it is worth distinguishing two cases.
First, consider a norm that limits the disclosure of some relatively specific and circumscribed pieces of information, such as a doctor not being allowed to reveal personal health information of patients outside of what is needed to provide care.
Second, a general norm that if you cooperate with someone and they provide you some info, you won’t use that info contrary to their interests. Its not 100% clear to me, but your post sounds a lot like this second one.
I think the second scenario raises a lot of issues. Its seems challenging to enforce, hard to understand and navigate, costly for people to attempt to conform to, and potentially counterproductive for what seems to be your goal. You are considering a specific case at a specific point in time, but I don’t think that gives the full picture of the impact of such a norm. For example, consider ex-OpenAI employees who left due to concerns about AI safety. Should the expectation be that they only use information and experience they gained at OpenAI in a way that OpenAI would approve of?
Now, if Epoch and/or specific individuals made commitments that they violated, that might be more like the first case, but its not clear that is what happened here. If it is, more explanation of how this is the case would be helpful, I think.
I agree that this issue is complex and I don’t pretend to have all of the solutions.
I just think it’s really bad if people feel that they can’t speak relatively freely with the forecasting organisations because they’ll misuse the information. I think this is somewhat similar to how it is important for folks to be able to speak freely to their doctor/lawyer/psychologist though I admit that the analogy isn’t perfect and that straightforwardly copying these norms over would probably be a mistake.
Nonetheless, I think it is worthwhile discussing whether there should be some kind of norms and what they should be. As you’ve rightly pointed out, are a lot of issues that would need to be considered. I’m not saying I know exactly what these norms should be. I see myself as more just starting a discussion.
(This is distinct from my separate point about it being a mistake to hire folk who do things like this. It is a mistake to have hired folks who act strongly against your interests even if they don’t break any ethical injuctions)
To “misuse” to me implies taking a bad action. Can you explain what misuse occurred here? If we assume that people at OpenAI now feel less able to speak freely after things that ex-OpenAI employees have said/done would you likewise characterize those people as having “misused” information or experience they gained at OpenAI? I understand you don’t have fully formed solutions and that’s completely understandable, but I think my questions go to a much more fundamentally issue about what the underlying problem actually is. I agree it is worth discussing, but I think it would clarify the discussion to understand what the intent of such a norm would (and if achieving that intent would in fact be desirable).
If Coca-Cola hires someone who later leaves and goes to work for Pepsi because Pepsi offered them higher compensation, I’m not sure it would make sense for Coca-Cola to conclude that they should make big changes to their hiring process, other than perhaps increasing their own compensation if they determine that is a systematic issue. Coca-Cola probably needs to accept that “its not personal” is sometimes going to be the natural of the situation. Obviously details matter, so maybe this case is different, but I think working in an environment where you need to cooperate with other people/institutions means you also have to sometimes accept that people you work with will make decisions based on their own judgements and interests, and therefore may do things you don’t necessarily agree with.
They’re recklessly accelerating AI. Or, at least, that’s how I see it. I’ll leave it to others to debate whether or not this characterisation is accurate.
Details matter. It depends on how bad it is and how rare these actions are.
I know I’ve responded to a lot of your comments, and I get the sense you don’t want to keep engaging with me, so I’ll try to keep it brief.
We both agree that details matter, and I think the details of what the actual problem is matter. If, at bottom, the thing that Epoch/these individuals have done wrong is recklessly accelerate AI, I think you should have just said that up top. Why all the “burn the commons”, “sharing information freely”, “damaging to trust” stuff? It seems like you’re saying at the end of the day, those things aren’t really the thing you have a problem with. On the other hand, I think invoking that stuff is leading you to consider approaches that won’t necessarily help with avoiding reckless acceleration, as I hope my OpenAI example demonstrates.
I believe those are useful frames for understanding the impacts.
I mean, good luck hiring people with a diversity of viewpoints who you’re also 100% sure will never do anything that you believe to be net negative. Like what does “diversity of viewpoints” even mean apart from that?
Everything has trade-offs.
I agree that attempting to be 100% sure that they’re responsible would be a mistake. Specifically, the unwanted impacts would likely be too high.
(note: I work at Epoch) This attitude feels like a recipe for creating an intellectual bubble. Of course people will use the knowledge they gain in collaboration with you for the purposes that they think are best. I think it would be pretty bad for the AI safety community if it just relied on forecasting work from card-carrying AI safety advocates.
It is entirely normal for there to be widely accepted, clearly formalized, and meaningfully enforced restrictions on how people use knowledge they’ve gotten in this or that setting… regardless of what they think is best. It’s a commonplace of professional ethics.
I don’t think this is true. People can’t really restrict their use of knowledge, and subtle uses are pretty unenforceable. So it’s expected that knowledge will be used in whatever they do next. Patents and noncompete clauses are attempts to work around this. They work a little, for a little.
Agreed. This is how these codes form. Someone does something like this and then people discuss and decide that there should be a rule against it or that it should at least be frowned upon.
Sure, there are in some very specific settings with long held professional norms that people agree to (e.g. doctors and lawyers). I don’t think this applies in this case, though you could try to create such a norm that people agree to.
I largely agree with the underlying point here, but I don’t think its quite correct that something like this only applies in specific professions. For example, I think every major company is going to expect employees to be careful about revealing internal info, and there are norms that apply more broadly (trade secrets, insider trading etc.).
As far as I can tell though, those are all highly dissimilar to this scenario because they involve an existing widespread expectation of not using information in a certain way. Its not even clear to me in this case what information was used in what way that is allegedly bad.
I would like to see serious thought given to instituting such a norm. There’s a lot of complexities here, figuring out what is or isn’t kosher would be challenging, but it should be explored.
Thanks for weighing in.
Oh, additional screening could very easily have unwanted side-effects. That’s why I wrote: “It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias” and why it would be better for this issue to never have arisen in the first place. Actions like this can create situations with no good trade-offs.
I was definitely not suggesting that the AI safety community should decide which forecasts to listen to based on the views of the forecasters. That’s irrelevant, we should pay attention to the best forecasters.
I was talking about funding decisions. This is a separate matter.
If someone else decides to fund a forecaster even though we’re worried they’re net-negative or they do work voluntarily, then we should pay attention to their forecasts if they’re good at their job.
Seems like several professions have formal or informal restrictions on how they can use information that they gain in a particular capacity to their advantage. People applying for a forecasting role are certainly entitled to say, ’If I learn anything about AI capabilities here, I may use it to start an AI startup and I won’t actually feel bad about this”. It doesn’t mean you have to hire them.
Say a “deathist” is someone who says “death is net good (gives meaning to life, is natural and therefore good, allows change in society, etc.)” and a “lifeist” (“anti-deathist”) is someone who says “death is net bad (life is good, people should not have to involuntarily die, I want me and my loved ones to live)”. There are clearly people who go deathist → lifeist, as that’s most lifeists (if nothing else, as an older kid they would have uttered deathism, as the predominant ideology). One might also argue that young kids are naturally lifeist, and therefore most people have gone lifeist → deathist once. Are there people who have gone deathist → lifeist → deathist? Are there people who were raised lifeist and then went deathist?
@Valentine comes to mind as a person who was raised lifeist and is now still lifesist, but I think has more complicated feelings/views about the situation related to enlightenment and metaphysics that make death an illusion, or something.
Other (more compelling to me) reasons for being a “deathist”:
Eternity can seem kinda terrifying.
In particular, death is insurance against the worst outcomes lasting forever. Things will always return to neutral eventually and stay there.
A lifeist doesn’t say “You must decide now to live literally forever no matter what happens.”!
Fine, but it still seems like a reason one could give for death being net good (which is your chief criterion for being a deathist).
I do think it’s a weaker reason than the second one. The following argument in defence of it is mainly for fun:
I slightly have the feeling that it’s like that decision theory problem where the devil offers you pieces of a poisoned apple one by one. First half, then a quarter, then an eighth, than a sixteenth… You’ll be fine unless you eat the whole apple, in which case you’ll be poisoned. Each time you’re offered a piece it’s rational to take it, but following that policy means you get poisoned.
The analogy is that I consider living for eternity to be scary, and you say, “well, you can stop any time”. True, but it’s always going to be rational for me to live for one more year, and that way lies eternity.
The distinction you want is probably not rational/irrational but CDT/UDT or whatever,
Also,
well, it’s also insurance against the best outcomes lasting forever (though you’re probably going to reply that bad outcomes are more likely than good outcomes and/or that you care more about preventing bad outcomes than ensuring good outcomes)
Me.
In what sense were you lifeist and now deathist? Why the change?
This is not quite deathism but perhaps a transition in the direction of “my own death is kinda not as bad”:
and in a comment:
The latest short story by Greg Egan is kind of a hit piece on LW/EA/longtermism. I’ve really enjoyed it. “DEATH AND THE GORGON” https://asimovs.com/wp-content/uploads/2025/03/DeathGorgon_Egan.pdf
(Previous commentary and discussion.)
Has anyone considered video recording streets around offices of OpenAI, Deepmind, Anthropic? Can use CCTV or drone. I’m assuming there are some areas where recording is legal.
Can map out employee social graphs, daily schedules and daily emotional states.
Did you mean to imply something similar to the pizza index?
If so, I think it’s a decent idea, but your phrasing may have been a bit unfortunate—I originally read it as a proposal to stalk AI lab employees.
Update: I’ll be more specific. There’s a power buys you distance from the crime phenomena going on if you’re okay with using Google maps data acquired on about their restaurant takeout orders, but not okay asking the restaurant employee yourself or getting yourself hired at the restaurant.
Pizza index and stalking employees are both the same thing, it’s hard to do one without the other. If you choose to declare war against AI labs you also likely accept that their foot soldiers are collateral damage.
I agree that (non-violent) stalking of employees is still a more hostile technique than writing angry posts on an internet forum.
Back in October 2024, I tried to test various LLM Chatbots with the question:
”Is there a way to convert a correlation to a probability while preserving the relationship 0 = 1/n?”
Years ago, I came up with an unpublished formula that does just that:
p(r) = (n^r * (r + 1)) / (2^r * n)
So I was curious if they could figure it out. Alas, back in October 2024, they all made up formulas that didn’t work.
Yesterday, I tried the same question on ChatGPT and, while it didn’t get it quite right, it came, very, very close. So, I modified the question to be more specific:
”Is there a way to convert a correlation to a probability while preserving the relationships 1 = 1, 0 = 1/n, and −1 = 0?”
This time, it came up with a formula that was different and simpler than my own, and… it actually works!
I tried this same prompt with a bunch of different LLM Chatbots and got the following:
Correct on the first prompt:
GPT4o, Claude 3.7
Correct after explaining that I wanted a non-linear, monotonic function:
Gemini 2.5 Pro, Grok 3
Failed:
DeepSeek-V3, Mistral Le Chat, QwenMax2.5, Llama 4
Took too long thinking and I stopped it:
DeepSeek-R1, QwQ
All the correct models got some variation of:
p(r) = ((r + 1) / 2)^log2(n)
This is notably simpler and arguably more elegant than my earlier formula. It also, unlike my old formula, has an easy to derive inverse function.
So yeah. AI is now better than me at coming up with original math.
Prediction:
We will soon see the first high-profile example of “misaligned” model behavior where a model does something neither the user nor the developer want it to do, but which instead appears to be due to scheming.
On examination, the AI’s actions will not actually be a good way to accomplish that goal. Other instances of the same model will be capable of recognizing this.
The AI’s actions will make a lot of sense as an extrapolated of some contextually-activated behavior which led to better average performance on some benchmark.
That is to say, the traditional story is
We use RL to train AI
AI learns to predict reward
AI decides that its goal is to maximize reward
AI reasons about what behavior will lead to maximal reward
AI does something which neither its creators nor the user want it to do, but that thing serves the AI’s long term goals, or at least it thinks that’s the case
We all die when the AI releases a bioweapon (or equivalent) to ensure no future competition
The AI takes to the stars, but without us
My prediction here is
We use RL to train AI
AI learns to recognize what the likely loss/reward signal is for its current task
AI learns a heuristic like “if the current task seems to have a gameable reward and success seems unlikely by normal means, try to game the reward”
AI ends up in some real-world situation which it decides resembles an unwinnable task (it knows it’s not being evaluated, but that doesn’t matter)
AI decides that some random thing it just thought of looks like success criterion
AI thinks of some plan which has an outside chance of “working” by that success criterion it just came up with
AI does some random pants-on-head stupid thing which its creators don’t want, the user doesn’t want, and which doesn’t serve any plausible long-term goal.
We all die when the AI releases some dangerous bioweapon because doing so pattern-matches to some behavior that helped in training, but not actually in a way that kills everyone and not only after it can take over the roles humans had