AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
Forgot to tell you this when you showed me the draft: The comp in sup paper actually had a dense construction for UAND included already. It works differently than the one you seem to have found though, using Gaussian weights rather than binary weights.
I will continue to do what I love, which includes reading and writing and thinking about biosecurity and diseases and animals and the end of the world and all that, and I will scrape out my existence one way or another.
Thank you. As far as I’m aware we don’t know each other at all, but I really appreciate you working to do good.
I don’t think the risks of talking about the culture war have gone down. If anything, it feels like it’s yet again gotten worse. What exactly is risky to talk about has changed a bit, but that’s it. I’m more reluctant than ever to involve myself in culture war adjacent discussions.
This comment by Carl Feynman has a very crisp formulation of the main problem as I see it.
They’re measuring a noisy phenomenon, yes, but that’s only half the problem. The other half of the problem is that society demands answers. New psychology results are a matter of considerable public interest and you can become rich and famous from them. In the gap between the difficulty of supply and the massive demand grows a culture of fakery. The same is true of nutrition— everyone wants to know what the healthy thing to eat is, and the fact that our current methods are incapable of discerning this is no obstacle to people who claim to know.
For a counterexample, look at the field of planetary science. Scanty evidence dribbles in from occasional spacecraft missions and telescopic observations, but the field is intellectually sound because public attention doesn’t rest on the outcome.
So, the recipe for making a broken science you can’t trust is
The public cares a lot about answers to questions that fall within the science’s domain.
The science currently has no good attack angles on those questions.
As you say, if a field is exposed to these incentives for a while, you get additional downstream problems like all the competent scientist who care about actual progress leaving. But I think that’s a secondary effect. If you replaced all the psychology grads with physics and electrical engineering grads overnight, I’d expect you’d at best get a very brief period of improvement before the incentive gradient brought the field back to the status quo. On the other hand, if the incentives suddenly changed, I think reforming the field might become possible.
This suggests that if you wanted to found new parallel fields of nutrition, psychology etc. you could trust, you should consider:
Making it rare for journalists to report on your new fields. Maybe there’s just a cultural norm against talking to the press and publishing on Twitter. Maybe people have to sign contracts about it if they want to get grants. Maybe the research is outright siloed because it is happening inside some company.
Finding funders who won’t demand answers if answers can’t be had. Seems hard. This might exclude most companies. The usual alternative is government&charity, but those tend to care too much about what the findings are. My model of how STEM manages to get useful funding out of them is that funding STEM is high-status, but STEM results are mostly too boring and removed from the public interest for the funders to get invested in them.
Relationship … stuff?
I guess I feel kind of confused by the framing of the question. I don’t have a model under which the sexual aspect of a long-term relationship typically makes up the bulk of its value to the participants. So, if a long-term relationship isn’t doing well on that front, and yet both participants keep pursuing the relationship, my first guess would be that it’s due to the value of everything that is not that. I wouldn’t particularly expect any one thing to stick out here. Maybe they have a thing where they cuddle and watch the sunrise together while they talk about their problems. Maybe they have a shared passion for arthouse films. Maybe they have so much history and such a mutually integrated life with partitioned responsibilities that learning to live alone again would be a massive labour investment, practically and emotionally. Maybe they admire each other. Probably there’s a mixture of many things like that going on. Love can be fed by many little sources.
So, this I suppose:
Their romantic partner offering lots of value in other ways. I’m skeptical of this one because female partners are typically notoriously high maintenance in money, attention, and emotional labor. Sure, she might be great in a lot of ways, but it’s hard for that to add up enough to outweigh the usual costs.
I don’t find it hard at all to see how that’d add up to something that vastly outweighs the costs, and this would be my starting guess for what’s mainly going on in most long-term relationships of this type.
This data seems to be for sexual satisfaction rather than romantic satisfaction or general relationship satisfaction.
How sub-light? I was mostly just guessing here, but if it’s below like 0.95c I’d be surprised.
It expands at light speed. That’s fast enough that no computational processing can possibly occur before we’re dead. Sure there’s branches where it maims us and then stops, but these are incredibly subdominant compared to branches where the tunneling doesn’t happen.
Yes, you can make suicide machines very reliable and fast. I claim that whether your proposed suicide machine actually is reliable does in fact matter for determining whether you are likely to find yourself maimed. Making suicide machines that are synchronised earth-wide seems very difficult with current technology.
This. The struggle is real. My brain has started treating publishing a LessWrong post almost the way it’d treat publishing a paper. An acquaintance got upset at me once because they thought I hadn’t provided sufficient discussion of their related Lesswrong post in mine. Shortforms are the place I still feel safe just writing things.
It makes sense to me that this happened. AI Safety doesn’t have a journal, and training programs heavily encourage people to post their output on LessWrong. So part of it is slowly becoming a journal, and the felt social norms around posts are morphing to reflect that.
I don’t think anything in the linked passage conflicts with my model of anticipated experience. My claim is not that the branch where everyone dies doesn’t exist. Of course it exists. It just isn’t very relevant for our future observations.
To briefly factor out the quantum physics here, because they don’t actually matter much:
If someone tells me that they will create a copy of me while I’m anesthetized and unconscious, and put one of me in a room with red walls, and another of me in a room with blue walls, my anticipated experience is that I will wake up to see red walls with and blue walls with . Because the set of people who will wake up and remember being me and getting anesthetized has size 2 now, and until I look at the walls I won’t know which of them I am.
If someone tells me that they will create a copy of me while I’m asleep, but they won’t copy the brain, making it functionally just a corpse, then put the corpse in a room with red walls, and me in a room with blue walls, my anticipated experience is that I will wake up to see blue walls with p=1.0. Because the set of people who will wake up and remember being me and going to sleep has size 1. There is no chance of me ‘being’ the corpse any more than there is a chance of me ‘being’ a rock. If the copy does include a brain, but the brain gets blown up with a bomb before the anaesthesia wears off, that doesn’t change anything. I’d see blue walls with , not see blue walls with and ‘not experience anything’ with .
The same basic principle applies to the copies of you that are constantly created as the wavefunction decoheres. The probability math in that case is slightly different because you’re dealing with uncertainty over a vector space rather than uncertainty over a set, so what matters is the squares of the amplitudes of the branches that contain versions of you. E.g. if there’s three branches, one in which you die, amplitude , one in which you wake up to see red walls, amplitude and one in which you wake up to see blue walls, amplitude , you’d see blue walls with probability ca. and red walls with probability .[1]
If you start making up scenarios that involve both wave function decoherence and having classical copies of you created, you’re dealing with probabilities over vector spaces and probabilities over sets at the same time. At that point, you probably want to use density matrices to do calculations.
There may be a sense in which amplitude is a finite resource. Decay your branch enough, and your future anticipated experience might come to be dominated by some alien with higher amplitude simulating you, or even just by your inner product with quantum noise in a more mainline branch of the wave function. At that point, you lose pretty much all ability to control your future anticipated experience. Which seems very bad. This is a barrier I ran into when thinking about ways to use quantum immortality to cheat heat death.
I don’t think so. You only need one alien civilisation in our light cone to have preferences about the shape of the universal wave function rather than their own subjective experience for our light cone to get eaten. E.g. a paperclip maximiser might want to do this.
Also, the fermi paradox isn’t really a thing.
No, because getting shot has a lot of outcomes that do not kill you but do cripple you. Vacuum decay should tend to have extremely few of those. It’s also instant, alleviating any lingering concerns about identity one might have in a setup where death is slow and gradual. It’s also synchronised to split off everyone hit by it into the same branch, whereas, say, a very high-yield bomb wired to a random number generator that uses atmospheric noise would split you off into a branch away from your friends.[1]
I’m not unconcerned about vacuum decay, mind you. It’s not like quantum immortality is all confirmed and the implications worked out well in math.[2]
They’re still there for you of course, but you aren’t there for most of them. Because in the majority of their anticipated experience, you explode.
Sometimes I think about the potential engineering applications of quantum immortality in a mature civilisation for fun. Controlled, synchronised civilisation-wide suicide seems like a neat way to transform many engineering problems into measurement problems.
Since I didn’t see it brought up on a skim: One reason me and a lot of my physicist friends aren’t that concerned about vacuum decay is many-worlds. Since the decay is triggered by quantum tunneling and propagates at light speed, it’d be wiping out earth in one wavefunction branch that has amplitude roughly equal to the amplitude of the tunneling, while the decay just never happens in the other branches. Since we can’t experience being dead, this wouldn’t really affect our anticipated future experiences in any way. The vacuum would just never decay from our perspective.
So, if the vacuum were confirmed to be very likely meta-stable, and the projected base rate of collapses was confirmed to be high enough that it ought to have happened a lot already, we’d have accidentally stumbled into a natural and extremely clean experimental setup for testing quantum immortality.
I disagreed with Gwern at first. I’m increasingly forced to admit there’s something like bipolar going on here
What changed your mind? I don’t know any details about the diagnostic criteria for bipolar besides those you and Gwern brought up in that debate. But looking at the points you made back then, it’s unclear to me which of them you’d consider to be refuted or weakened now.
Musk’s ordinary behavior—intense, risk-seeking, hard-working, grandiose, emotional—does resemble symptoms of hypomania (full mania would usually involve psychosis, and even at his weirdest Musk doesn’t meet the clinical definition for this).
But hypomania is usually temporary and rare. A typical person with bipolar disorder might have hypomania for a week or two, once every few years. Musk is always like this. Bipolar disorder usually starts in one’s teens. But Musk was like this even as a child.
....
His low periods might meet criteria for a mixed episode. But a bipolar disorder that starts in childhood, continues all the time, has no frank mania, and has only mixed episodes instead of depression—doesn’t really seem like bipolar disorder to me. I’m not claiming there’s nothing weird about him, or that he doesn’t have extreme mood swings. I’m just saying it is not exactly the kind of weirdness and mood swings I usually associate with bipolar.
...
I notice the non-psychiatrists (including very smart people I usually trust) lining up on one side, and the psychiatrists on the other. I think this is because Musk fits a lot of the explicit verbally described symptoms of the condition, but doesn’t resemble real bipolar patients.
...
This isn’t how I expect bipolar to work. There is no “switch flipping” (except very occasionally when a manic episode follows directly after a depressive one). A patient will be depressed for weeks or months, then gradually come out of it, and after weeks or months of coming out of it, get back to normal. Being “moody” in the sense of having mood swings is kind of the opposite of bipolar; I would associate it more with borderline or PTSD.
Based on my understanding of what you are doing, the statement in the OP that in your setting is “sort of” K-complexity is a bit misleading?
Yes, I guess it is. In my (weak) defence, I did put a ‘(sort of)’ in front of that.
In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis expressed on two different UTMs.
If we have a UTM and a different UTM , we know that , because if nothing else we can simulate UTM on UTM and compute on the simulated . But in real life, we’d usually expect the actual shortest program that implements on to not involve jumping through hoops like this.
In the case of translating between a UTM and a different sort of Turing-complete model of computation, namely a recurrent neural network[1], I was expecting a similar sort of dynamic: If nothing else, we can always implement on the NN by simulating a UTM, and running on that simulated UTM. So the lowest LLC parameter configuration that implements on the NN has to have an LLC that is as small or smaller as the LLC of a parameter configuration that implements through this simulation route. Or that was the intuition I had starting out anyway.
If I understand correctly you are probably doing something like:
Seems broadly right to me except:
Third bullet point: I don’t know what you mean by a “smooth relaxation” precisely. So while this sounds broadly correct to me as a description of what I do, I can’t say for sure.
Sixth bullet point: You forgot the offset term for simulating the UTM on the transformer. Also, I think I’d get a constant prefactor before . Even if I’m right that the prefactor I have right now could be improved, I’d still expect at least a here.
I’d caution that the exact relation to the learning coefficient and the LLC is the part of this story I’m still the least confident about at the moment. As the intro said
This post is my current early-stage sketch of the proof idea. Don’t take it too seriously yet. I’m writing this out mostly to organise my own thoughts.
I’ve since gotten proof sketches for most of the parts here, including the upper bound on the LLC, so I am a bit more confident now. But they’re still hasty scrawlings.
you are treating the iid case
I am not sure whether I am? I’m a bit unclear on what you mean by iid in this context exactly. The setup does not seem to me to require different inputs to be independent of each other. It does assume that each label is a function of its corresponding input rather than some other input. So, label can depend on input , but it can only depend on in a manner mediated by . In other words, the joint probability distribution over inputs can be anything, but the labels must be iid conditioned on their inputs. I think. Is that what you meant?
From your message it seems like you think the global learning coefficient might be lower than , but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than “the number of bits used” when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).
I have been brooding over schemes to lower the bound I sketched above using activation error-correction blocks. Still unclear to me at this stage whether this will work or not. I’d say this and the workability of other schemes to get rid of the prefactactor to in the bound are probably the biggest source of uncertainty about this at the moment.
If schemes like this work, the story here probably ends up as something more like ′ is related to the number of bits in the parameters we need to fix to implement on the transformer.′
In that case, you’d be right, and the LLC would be lower, because in the continuum limit we can store an arbitrary number of bits in a single parameter.
I think I went into this kind of expecting that to be true. Then I got surprised when using less than one effective parameter per bit of storage in the construction turned out to be less straightforward than I’d thought once I actually engaged with the details. Now, I don’t know what I’ll end up finding.
Well, transformers are not actually Turing complete in real life where parameters aren’t real numbers, because if you want an unbounded context window to simulate unbounded tape, you eventually run out of space for positional encodings. But the amount of bits they can hold in memory does grow exponentially with the residual stream width, which seems good enough to me. Real computers don’t have infinite memory either.
Kind of? I’d say the big difference are
Experts are pre-wired to have a certain size, components can vary in size from tiny query-key lookup for a single fact to large modules.
IIRC, MOE networks use a gating function to decide which experts to query. If you ignored this gating and just use all the experts, I think that’d break the model. In contrast, you can use all APD components on a forward pass if you want. Most of them just won’t affect the result much.
MOE experts don’t completely ignore ‘simplicity’ as we define it in the paper though. A single expert is simpler than the whole MOE network in that it has lower rank/ fewer numbers are required to describe its state on any given forward pass.
Why would this be restricted to cyber attacks? If the CCP believed that ASI was possible, even if they didn’t believe in the alignment problem, the US developing an ASI would plausibly constitute an existential threat to them. It’d mean they lose the game of geopolitics completely and permanently. I don’t think they’d necessarily restrict themselves to covert sabotage in such a situation.
The possibility of stability through dynamics like mutually assured destruction has been where a lot of my remaining hope on the governance side has come from for a while now.
A big selling point of this for me is that it does not strictly require countries to believe that ASI is possible and that the alignment problem is real. Just believing that ASI is possible is enough.
On a first read, this doesn’t seem principled to me? How do we know those high-frequency latents aren’t, for example, basis directions for dense subspaces or common multi-dimensional features? In that case, we’d expect them to activate frequently and maybe appear pretty uninterpretable at a glance. Modifying the sparsity penalty to split them into lower frequency latents could then be pathological, moving us further away from capturing the features of the model even though interpretability scores might improve.
That’s just one illustrative example. More centrally, I don’t understand how this new penalty term relates to any mathematical definition that isn’t ad-hoc. Why would the spread of the distribution matter to us, rather than simply the mean? If it does matter to us, why does it matter in roughly the way captured by this penalty term?
The standard SAE sparsity loss relates to minimising the description length of the activations. I suspect that isn’t the right metric to optimise for understanding models, but it is at least a coherent, non-ad-hoc mathematical object.