Lucius Bushnaq

Karma: 3,850

AI notkilleveryoneism researcher, focused on interpretability.

Personal account, opinions are my own.

I have signed no contracts or agreements whose existence I cannot mention.

Lucius Bushnaq Apr 9, 2025, 10:33 AM
10 points
1
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
I think the value proposition of AI 2027-style work lies largely in communication. Concreteness helps people understand things better. The details are mostly there to provide that concreteness, not to actually be correct.
If you imagine the set of possible futures that people like Daniel, you or I think plausible as big distributions, with high entropy and lots of unknown latent variables, the point is that the best way to start explaining those distributions to people outside the community is to draw a sample from them and write it up. This is a lot of work, but it really does seem to help. My experience matches habryka’s here. Most people really want to hear concrete end-to-end scenarios, not abstract discussion of the latent variables in my model and their relationships.

Lucius Bushnaq Apr 4, 2025, 4:44 PM
2 points
0
in reply to: Jeremy Gillen’s comment on: Changing my mind about Christiano’s malign prior argument
The bound is the same one you get for normal Solomonoff induction, except restricted to the set of programs the cut-off induction runs over. It’s a bound on the total expected error in terms of CE loss that the predictor will ever make, summed over all datapoints.

Look at the bound for cut-off induction in that post I linked, maybe? Hutter might also have something on it.
Can also discuss on a call if you like.

Note that this doesn’t work in real life, where the programs are not in fact restricted to outputting bit string predictions and can e.g. try to trick the hardware they’re running on.

Lucius Bushnaq Apr 4, 2025, 3:54 PM
2 points
0
in reply to: Jeremy Gillen’s comment on: Changing my mind about Christiano’s malign prior argument
You also want one that generalises well, and doesn’t do preformative predictions, and doesn’t have goals of its own. If your hypotheses aren’t even intended to be reflections of reality, how do we know these properties hold?
Because we have the prediction error bounds.
When we compare theories, we don’t consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
E.g. the theory of evolution isn’t quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn’t involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.
Yes.

Lucius Bushnaq Apr 4, 2025, 3:21 PM
5 points
1
in reply to: Jeremy Gillen’s comment on: Changing my mind about Christiano’s malign prior argument
That’s fine. I just want a computable predictor that works well. This one does.
Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is about finding more effective approximations for stuff.
Edit: Actually, I don’t think this would yield you a different general predictor as the program dominating the posterior. General inductor program $P_{1}$ running program $P_{2}$ is pretty much never going to be the shortest implementation of $P_{2}$ .

Lucius Bushnaq Apr 4, 2025, 2:05 PM
2 points
0
in reply to: Garrett Baker’s comment on: Changing my mind about Christiano’s malign prior argument
If you make an agent by sticking together cut-off Solomonoff induction and e.g. causal decision theory, I do indeed buy that this agent will have problems. Because causal decision theory has problems.

Lucius Bushnaq Apr 4, 2025, 7:28 AM
9 points
4
on: Changing my mind about Christiano’s malign prior argument
Thank you for this summary.
I still find myself unconvinced by all the arguments against the Solomonoff prior I have encountered. For this particular argument, as you say, there’s still many ways the conjectured counterexample of adversaria could fail if you actually tried to sit down and formalise it. Since the counterexample is designed to break a formalism that looks and feels really natural and robust to me, my guess is that the formalisation will indeed fall to one of these obstacles, or a different one.
In a way, that makes perfect sense; Solomonoff induction really can’t run in our universe! Any robot we could build to “use Solomonoff induction” would have to use some approximation, which the malign prior argument may or may not apply to.
You can just reason about Solomonoff induction with cut-offs instead. If you render the induction computable by giving it a uniform prior over all programs of some finite length $N$ ^[1] with runtime $\leq T$ , it still seems to behave sanely. As in, you can derive analogs of the key properties of normal Solomonoff induction for this cut-off induction. E.g. the induction will not make more than $K (P^{*})$ bits worth of prediction mistakes compared to any ‘efficient predictor’ program $P^{*}$ with runtime $< T$ and K-complexity $K (P^{*}) < N$ , it’s got a rough invariance to what Universal Turing Machine you run it on, etc. .
Since finite, computable things are easier for me to reason about, I mostly use this cut-off induction in my mental toy models of AGI these days.
EDIT: Apparently, this exists in the literature under the name AIXI-tl. I didn’t know that. Neat.
1. ^
  So, no prefix-free requirement.

Lucius Bushnaq Apr 1, 2025, 8:58 AM
19 points
0
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
- A quick google search says the male is primary or exclusive breadwinner in a majority of married couples. Ass-pull number: the monetary costs alone are probably ~50% higher living costs. (Not a factor of two higher, because the living costs of two people living together are much less than double the living costs of one person. Also I’m generally considering the no-kids case here; I don’t feel as confused about couples with kids.
But remember that you already conditioned on ‘married couples without kids’. My guess would be that in the subset of man-woman married couples without kids, the man being the exclusive breadwinner is a lot less common than in the set of all man-woman married couples. These properties seem like they’d be heavily anti-correlated.

In the subset of man-woman married couples without kids that get along, I wouldn’t be surprised if having a partner effectively works out to more money for both participants, because you’ve got two incomes, but less than 2x living expenses.
- I was picturing an anxious attachment style as the typical female case (without kids). That’s unpleasant on a day-to-day basis to begin with, and I expect a lack of sex tends to make it a lot worse.
I am … not … picturing that as the typical case? Uh, I don’t know what to say here really. That’s just not an image that comes to mind for me when I picture ‘older hetero married couple’. Plausibly I don’t know enough normal people to have a good sense of what normal marriages are like.
- Eyeballing Aella’s relationship survey data, a bit less than a third of respondents in 10-year relationships reported fighting multiple times a month or more. That was somewhat-but-not-dramatically less than I previously pictured. Frequent fighting is very prototypically the sort of thing I would expect to wipe out more-than-all of the value of a relationship, and I expect it to be disproportionately bad in relationships with little sex.
I think for many of those couples that fight multiple times a month, the alternative isn’t separating and finding other, happier relationships where there are never any fights. The typical case I picture there is that the relationship has some fights because both participants aren’t that great at communicating or understanding emotions, their own or other people’s. If they separated and found new relationships, they’d get into fights in those relationships as well.
It seems to me that lots of humans are just very prone to getting into fights. With their partners, their families, their roommates etc., to the point that they have accepted having lots of fights as a basic fact of life. I don’t think the correct takeaway from that is ‘Most humans would be happier if they avoided having close relationships with other humans.’
- Less legibly… conventional wisdom sure sounds like most married men find their wife net-stressful and unpleasant to be around a substantial portion of the time, especially in the unpleasant part of the hormonal cycle, and especially especially if they’re not having much sex. For instance, there’s a classic joke about a store salesman upselling a guy a truck, after upselling him a boat, after upselling him a tackle box, after [...] and the punchline is “No, he wasn’t looking for a fishing rod. He came in looking for tampons, and I told him ‘dude, your weekend is shot, you should go fishing!’”.
Conventional wisdom also has it that married people often love each other so much they would literally die for their partner. I think ‘conventional wisdom’ is just a very big tent that has room for everything under the sun. If even 5-10% of married couples have bad relationships where the partners actively dislike each other, that’d be many millions of people in the English speaking population alone. To me, that seems like more than enough people to generate a subset of well-known conventional wisdoms talking about how awful long-term relationships are.
Case in point, I feel like I hear those particular conventional wisdoms less commonly these days in the Western world. My guess is this is because long-term heterosexual marriage is no longer culturally mandatory, so there’s less unhappy couples around generating conventional wisdoms about their plight.
So, next question for people who had useful responses (especially @Lucius Bushnaq and @yams): do you think the mysterious relationship stuff outweighs those kinds of costs easily in the typical case, or do you imagine the costs in the typical case are not all that high?
So, in summary, both I think? I feel like the ‘typical’ picture of a hetero marriage you sketch is more like my picture of an ‘unusually terrible’ marriage. You condition on a bad sexual relationship and no children and the woman doesn’t earn money and the man doesn’t even like her, romantically or platonically. That subset of marriages sure sounds like it’d have a high chance of the man just walking away, barring countervailing cultural pressures. But I don’t think most marriages where the sex isn’t great are like that.

Lucius Bushnaq Mar 29, 2025, 1:30 PM
2 points
0
in reply to: Senthooran Rajamanoharan’s comment on: Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Sure, I agree that, as we point out in the post
Yes, sorry I missed that. The section is titled ‘Conclusions’ and comes at the end of the post, so I guess I must have skipped over it because I thought it was the post conclusion section rather than the high-frequency latents conclusion section.
As long as your evaluation metrics measure the thing you actually care about...
I agree with this. I just don’t think those autointerp metrics robustly capture what we care about.

Lucius Bushnaq Mar 28, 2025, 12:52 PM
1 point
0
on: Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Removing High Frequency Latents from JumpReLU SAEs
On a first read, this doesn’t seem principled to me? How do we know those high-frequency latents aren’t, for example, basis directions for dense subspaces or common multi-dimensional features? In that case, we’d expect them to activate frequently and maybe appear pretty uninterpretable at a glance. Modifying the sparsity penalty to split them into lower frequency latents could then be pathological, moving us further away from capturing the features of the model even though interpretability scores might improve.

That’s just one illustrative example. More centrally, I don’t understand how this new penalty term relates to any mathematical definition that isn’t ad-hoc. Why would the spread of the distribution matter to us, rather than simply the mean? If it does matter to us, why does it matter in roughly the way captured by this penalty term?
The standard SAE sparsity loss relates to minimising the description length of the activations. I suspect that isn’t the right metric to optimise for understanding models, but it is at least a coherent, non-ad-hoc mathematical object.
EDIT: Oops, you address all that in the conclusion, I just can’t read.

Lucius Bushnaq Mar 28, 2025, 11:46 AM
3 points
0
on: Computational Superposition in a Toy Model of the U-AND Problem
Forgot to tell you this when you showed me the draft: The comp in sup paper actually had a dense construction for UAND included already. It works differently than the one you seem to have found though, using Gaussian weights rather than binary weights.

Lucius Bushnaq Mar 27, 2025, 7:23 PM
6 points
0
on: Eukaryote Skips Town—Why I’m leaving DC
I will continue to do what I love, which includes reading and writing and thinking about biosecurity and diseases and animals and the end of the world and all that, and I will scrape out my existence one way or another.
Thank you. As far as I’m aware we don’t know each other at all, but I really appreciate you working to do good.

Lucius Bushnaq Mar 22, 2025, 1:13 PM
7 points
0
in reply to: Richard_Ngo’s comment on: Elite Coordination via the Consensus of Power
I don’t think the risks of talking about the culture war have gone down. If anything, it feels like it’s yet again gotten worse. What exactly is risky to talk about has changed a bit, but that’s it. I’m more reluctant than ever to involve myself in culture war adjacent discussions.

Lucius Bushnaq Mar 22, 2025, 8:51 AM
13 points
2
in reply to: johnswentworth’s comment on: Why Are The Human Sciences Hard? Two New Hypotheses
This comment by Carl Feynman has a very crisp formulation of the main problem as I see it.
They’re measuring a noisy phenomenon, yes, but that’s only half the problem. The other half of the problem is that society demands answers. New psychology results are a matter of considerable public interest and you can become rich and famous from them. In the gap between the difficulty of supply and the massive demand grows a culture of fakery. The same is true of nutrition— everyone wants to know what the healthy thing to eat is, and the fact that our current methods are incapable of discerning this is no obstacle to people who claim to know.
For a counterexample, look at the field of planetary science. Scanty evidence dribbles in from occasional spacecraft missions and telescopic observations, but the field is intellectually sound because public attention doesn’t rest on the outcome.
So, the recipe for making a broken science you can’t trust is
1. The public cares a lot about answers to questions that fall within the science’s domain.
2. The science currently has no good attack angles on those questions.
As you say, if a field is exposed to these incentives for a while, you get additional downstream problems like all the competent scientist who care about actual progress leaving. But I think that’s a secondary effect. If you replaced all the psychology grads with physics and electrical engineering grads overnight, I’d expect you’d at best get a very brief period of improvement before the incentive gradient brought the field back to the status quo. On the other hand, if the incentives suddenly changed, I think reforming the field might become possible.
This suggests that if you wanted to found new parallel fields of nutrition, psychology etc. you could trust, you should consider:
1. Making it rare for journalists to report on your new fields. Maybe there’s just a cultural norm against talking to the press and publishing on Twitter. Maybe people have to sign contracts about it if they want to get grants. Maybe the research is outright siloed because it is happening inside some company.
2. Finding funders who won’t demand answers if answers can’t be had. Seems hard. This might exclude most companies. The usual alternative is government&charity, but those tend to care too much about what the findings are. My model of how STEM manages to get useful funding out of them is that funding STEM is high-status, but STEM results are mostly too boring and removed from the public interest for the funders to get invested in them.

Lucius Bushnaq Mar 21, 2025, 5:34 PM
30 points
27
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
Relationship … stuff?
I guess I feel kind of confused by the framing of the question. I don’t have a model under which the sexual aspect of a long-term relationship typically makes up the bulk of its value to the participants. So, if a long-term relationship isn’t doing well on that front, and yet both participants keep pursuing the relationship, my first guess would be that it’s due to the value of everything that is not that. I wouldn’t particularly expect any one thing to stick out here. Maybe they have a thing where they cuddle and watch the sunrise together while they talk about their problems. Maybe they have a shared passion for arthouse films. Maybe they have so much history and such a mutually integrated life with partitioned responsibilities that learning to live alone again would be a massive labour investment, practically and emotionally. Maybe they admire each other. Probably there’s a mixture of many things like that going on. Love can be fed by many little sources.

So, this I suppose:
Their romantic partner offering lots of value in other ways. I’m skeptical of this one because female partners are typically notoriously high maintenance in money, attention, and emotional labor. Sure, she might be great in a lot of ways, but it’s hard for that to add up enough to outweigh the usual costs.
I don’t find it hard at all to see how that’d add up to something that vastly outweighs the costs, and this would be my starting guess for what’s mainly going on in most long-term relationships of this type.

Lucius Bushnaq Mar 21, 2025, 9:32 AM
7 points
5
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
This data seems to be for sexual satisfaction rather than romantic satisfaction or general relationship satisfaction.

Lucius Bushnaq Mar 18, 2025, 6:01 PM
2 points
0
in reply to: JessRiedel’s comment on: Vacuum Decay: Expert Survey Results
How sub-light? I was mostly just guessing here, but if it’s below like 0.95c I’d be surprised.

Lucius Bushnaq Mar 18, 2025, 6:57 AM
2 points
0
in reply to: JessRiedel’s comment on: Vacuum Decay: Expert Survey Results
It expands at light speed. That’s fast enough that no computational processing can possibly occur before we’re dead. Sure there’s branches where it maims us and then stops, but these are incredibly subdominant compared to branches where the tunneling doesn’t happen.
Yes, you can make suicide machines very reliable and fast. I claim that whether your proposed suicide machine actually is reliable does in fact matter for determining whether you are likely to find yourself maimed. Making suicide machines that are synchronised earth-wide seems very difficult with current technology.

Lucius Bushnaq Mar 17, 2025, 10:41 PM
14 points
7
in reply to: Garrett Baker’s comment on: Joseph Miller’s Shortform
This. The struggle is real. My brain has started treating publishing a LessWrong post almost the way it’d treat publishing a paper. An acquaintance got upset at me once because they thought I hadn’t provided sufficient discussion of their related Lesswrong post in mine. Shortforms are the place I still feel safe just writing things.

It makes sense to me that this happened. AI Safety doesn’t have a journal, and training programs heavily encourage people to post their output on LessWrong. So part of it is slowly becoming a journal, and the felt social norms around posts are morphing to reflect that.

Lucius Bushnaq Mar 17, 2025, 8:42 AM
2 points
0
in reply to: cubefox’s comment on: Vacuum Decay: Expert Survey Results
I don’t think anything in the linked passage conflicts with my model of anticipated experience. My claim is not that the branch where everyone dies doesn’t exist. Of course it exists. It just isn’t very relevant for our future observations.
To briefly factor out the quantum physics here, because they don’t actually matter much:
If someone tells me that they will create a copy of me while I’m anesthetized and unconscious, and put one of me in a room with red walls, and another of me in a room with blue walls, my anticipated experience is that I will wake up to see red walls with $p = 0.5$ and blue walls with $p = 0.5$ . Because the set of people who will wake up and remember being me and getting anesthetized has size 2 now, and until I look at the walls I won’t know which of them I am.
If someone tells me that they will create a copy of me while I’m asleep, but they won’t copy the brain, making it functionally just a corpse, then put the corpse in a room with red walls, and me in a room with blue walls, my anticipated experience is that I will wake up to see blue walls with p=1.0. Because the set of people who will wake up and remember being me and going to sleep has size 1. There is no chance of me ‘being’ the corpse any more than there is a chance of me ‘being’ a rock. If the copy does include a brain, but the brain gets blown up with a bomb before the anaesthesia wears off, that doesn’t change anything. I’d see blue walls with $p = 1.0$ , not see blue walls with $p = 0.5$ and ‘not experience anything’ with $p = 0.5$ .
The same basic principle applies to the copies of you that are constantly created as the wavefunction decoheres. The probability math in that case is slightly different because you’re dealing with uncertainty over a vector space rather than uncertainty over a set, so what matters is the squares of the amplitudes of the branches that contain versions of you. E.g. if there’s three branches, one in which you die, amplitude $\approx 0.8944$ , one in which you wake up to see red walls, amplitude $\approx 0.2828$ and one in which you wake up to see blue walls, amplitude $\approx 0.3464$ , you’d see blue walls with probability ca. $p = \frac{{0.3464}^{2}}{{0.3464}^{2} + {0.2828}^{2}} = 0.6$ and red walls with probability $p = \frac{{0.2828}^{2}}{{0.3464}^{2} + {0.2828}^{2}} = 0.4$ .^[1]
1. ^
  If you start making up scenarios that involve both wave function decoherence and having classical copies of you created, you’re dealing with probabilities over vector spaces and probabilities over sets at the same time. At that point, you probably want to use density matrices to do calculations.

Lucius Bushnaq Mar 15, 2025, 8:35 AM
9 points
5
in reply to: the gears to ascension’s comment on: Vacuum Decay: Expert Survey Results
There may be a sense in which amplitude is a finite resource. Decay your branch enough, and your future anticipated experience might come to be dominated by some alien with higher amplitude simulating you, or even just by your inner product with quantum noise in a more mainline branch of the wave function. At that point, you lose pretty much all ability to control your future anticipated experience. Which seems very bad. This is a barrier I ran into when thinking about ways to use quantum immortality to cheat heat death.