Just to clarify, do you mean something like “elephant = grey + big + trunk + ears + African + mammal + wise” so to encode a tiny elephant you would have “grey + tiny + trunk + ears + African + mammal + wise” which the model could still read off as 0.86 elephant when relevant, but also tiny when relevant.
J Bostock
I think you should pay in Counterfactual Mugging, and this is one of the newcomblike problem classes that is most common in real life.
Example: you find a wallet on the ground. You can, from least to most pro social:
Take it and steal the money from it
Leave it where it is
Take it and make an effort to return it to its owner
Let’s ignore the first option (suppose we’re not THAT evil). The universe has randomly selected you today to be in the position where your only options are to spend some resources to no personal gain, or not. In a parallel universe, perhaps your pocket had the hole in it, and a random person has come across your wallet.
Firstly, what they might be thinking is “Would this person do the same for me?”
Secondly, in a society which wins, people return each others’ wallets.
You might object that this is different from the Mugging, because you’re directly helping someone else in this case. But I would counter that the Mugging is the true version of this problem, one where you have no crutch of empathy to help you, so your decision theory alone is tested.
I have added a link to the report now.
As to your point: this is one of the better arguments I’ve heard that welfare ranges might be similar between animals. Still I don’t think it squares well with the actual nature of the brain. Saying there’s a single suffering computation would make sense if the brain was like a CPU, where one core did the thinking, but actually all of the neurons in the brain are firing at once and doing computations in at the same time. So it makes much more sense to me to think that the more neurons are computing some sort of suffering, the greater the intensity of suffering.
Good point, edited a link to the Google Doc into the post.
From Rethink Priorities:
We used Monte Carlo simulations to estimate, for various sentience models and across eighteen organisms, the distribution of plausible probabilities of sentience.
We used a similar simulation procedure to estimate the distribution of welfare ranges for eleven of these eighteen organisms, taking into account uncertainty in model choice, the presence of proxies relevant to welfare capacity, and the organisms’ probabilities of sentience (equating this probability with the probability of moral patienthood)
Now with the disclaimer that I do think that RP are doing good and important work and are one of the few organizations seriously thinking about animal welfare priorities...
Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.
Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second than human neurons. The authors get around this by refusing to stake themselves on any theory of consciousness.
The overall structure of the RP welfare range report, does not cut to the truth, instead the core mental motion seems to be to engage with as many existing piece of work as possible; credence is doled out to different schools of thought and pieces of evidence in a way which seems more like appeasement, lip-service, or a “well these guys have done some work, who are we disrespect them by ignoring it” attitude. Removal of noise is one of the most important functions of meta-analysis, and it is largely absent.
The result of this is an epistemology where the accuracy of a piece of work is a monotonically increasing function of the number of sources, theories, and lines of argument. Which is fine if your desired output is a very long Google doc, and a disclaimer to yourself (and, more cynically, your funders) that “No no, we did everything right, we reviewed all the evidence and took it all into account.” but it’s pretty bad if you want to actually be correct.
I grow increasingly convinced that the epistemics of EA are not especially good, worsening, and already insufficient to work on the relatively low-stakes and easy issue of animal welfare (as compared to AI x-risk).
If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is “baked into” the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic’s observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html
That might be true but I’m not sure it matters. For an AI to learn an abstraction it will have a finite amount of training time, context length, search space width (if we’re doing parallel search like with o3) etc. and it’s not clear how the abstraction height will scale with those.
Empirically, I think lots of people feel the experience of “hitting a wall” where they can learn abstraction level n-1 easily from class; abstraction level n takes significant study/help; abstraction level n+1 is not achievable for them within reasonable time. So it seems like the time requirement may scale quite rapidly with abstraction level?
I second this, it could easily be things which we might describe as “amount of information that can be processed at once, including abstractions” which is some combination of residual stream width and context length.
Imagine an AI can do a task that takes 1 hour. To remain coherent over 2 hours, it could either use twice as much working memory, or compress it into a higher level of abstraction. Humans seem to struggle with abstraction in a fairly continuous way (some people get stuck at algebra; some cs students make it all the way to recursion then hit a wall; some physics students can handle first quantization but not second quantization) which sorta implies there’s a maximum abstraction stack height which a mind can handle, which varies continuously.
Only partially relevant, but it’s exciting to hear a new John/David paper is forthcoming!
Everything I Know About Semantics I Learned From Music Notation
Furthermore: normalizing your data to variance=1 will change your PCA line (if the X and Y variances are different) because the relative importance of X and Y distances will change!
Thanks for writing this up. As someone who was not aware of the eye thing I think it’s a good illustration of the level that the Zizians are on, i.e. misunderstanding key important facts about the neurology that is central to their worldview.
My model of double-hemisphere stuff, DID, tulpas, and the like is somewhat null-hypothesis-ish. The strongest version is something like this:
At the upper levels of predictive coding, the brain keeps track of really abstract things about yourself. Think “ego” “self-conception” or “narrative about yourself”. This is normally a model of your own personality traits, which may be more or less accurate. But there’s no particular reason why you couldn’t build a strong self-narrative of having two personalities, a sub-personality, or more. If you model yourself as having two personalities who can’t access each other’s memories, then maybe you actually just won’t perform the query-key lookups to access the memories.
Like I said, this doesn’t rule out a large amount of probability mass, but it does explain some things, fit in with my other views, and hopefully if someone has had/been close to experiences kinda like DID or zizianism or tulpas, it provides a less horrifying way of thinking about them. Some of the reports in this area are a bit infohazardous, and I think this null model at least partially defuses those infohazard.
This is a very interesting point. I have upvoted this post even though I disagree with it because I think the question of “Who will pay, and how much will they pay, to restrict others’ access AI?” is important.
My instinct is that this won’t happen, because there are too many AI companies for this deal to work on all of them, and some of these AI companies will have strong kinda-ideological commitments to not doing this. Also, my model of (e.g. OpenAI) is that they want to eat as much of the world’s economy as possible, and this is better done by selling (even at a lower revenue) to anyone who wants an AI SWE than selling just to Oracle.
o4 (God I can’t believe I’m already thinking about o4) as a b2b saas project seems unlikely to me. Specifically I’d put <30% odds that the o4-series have their prices jacked up or its API access restricted in order to allow some companies to monopolize its usage for more than 3 months without an open release. This won’t apply if the only models in the o4 series cost $1000s per answer to serve, since that’s just a “normal” kind of expensive.
Then, we have to consider that other labs are 1-1.5 years behind, and it’s hard to imagine Meta (for example) doing this in anything like the current climate.
That’s part of what I was trying to get at with “dramatic” but I agree now that it might be 80% photogenicity. I do expect that 3000 Americans killed by (a) humanoid robot(s) on camera would cause more outrage than 1 million Americans killed by a virus which we discovered six months later was AI-created in some way.
Previous ballpark numbers I’ve heard floated around are “100,000 deaths to shut it all down” but I expect the threshold will grow as more money is involved. Depends on how dramatic the deaths are though, 3000 deaths was enough to cause the US to invade two countries back in the 2000s. 100,000 deaths is thirty-three 9/11s.
Is there a particular reason to not include sex hormones? Some theories suggest that testosterone tracks relative social status. We might expect that high social status → less stress (of the cortisol type) + more metabolic activity. Since it’s used by trans people we have a pretty good idea of what it does to you at high doses (makes you hungry, horny, and angry) but its unclear whether it actually promotes low cortisol-stress and metabolic activity.
I’m mildly against this being immortalized as part of the 2023 review, though I think it serves excellently as a community announcement for Bay Area rats, which seems to be its original purpose.
I think it has the most long-term relevant information (about AI and community building) back loaded and the least relevant information (statistics and details about a no-longer-existent office space in the Bay Area) front loaded. This is a very Bay Area centric post, which I don’t think is ideal.
A better version of this post would be structured as a round up of the main future-relevant takeaways, with specifics from the office space as examples.
I’m only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don’t think distributional shift applies.
I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
If we do that for the entire training process, we would not expect to end up with a scheming V.
The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.
Is the distinction between “elephant + tiny” and “exampledon” primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent “has a bright purple spleen” but exampledons do, then the model might need to instead produce a “purple” vector as an output from an MLP whenever “exampledon” and “spleen” are present together.