Did you check what sort of bias (if any) they had towards (Eastern) Asian men? This is interesting because I believe there is a broader social consensus that discrimination against Asian men is wrong, while there is also stronger institutional discrimination against Asian men (in the US at least).
Adele Lopez
The Born probabilities are in the mind under the MWI! Reality just has the amplitudes.
Consider an agent about to observe a particle in superposition (causing an subjective collapse). If our agent accepts quantum mechanics, then it will predict with near-certainty that what will happen will be just what the Schrödinger equation dictates. This will result in a world in which the agent is entangled with the particle, which to the agents’ perspective looks like two branches, one where each “world” happens (and each with a specific amplitude).
So where are the Born probabilities? What are they even about? They are not about the objective state of the world. Nor even about the agent’s subjective knowledge of the objective state of the world. They are about the subjective anticipated experiences of the agent! The agent knows exactly what will happen, but not what its own eyes will actually see next.
How does the agent actually determine what those probabilities are? Like many priors, it ultimately grounds out in symmetry. If the particle was an a superposition where each state had equal amplitudes, then the agent has no basis by which to favor any one of these, and so chooses equal probabilities for each state. A similar symmetry holds for the phase of the amplitude. Then there’s a more nuanced symmetry (known as the Epistemic Separability Principle) which essentially says that the agent’s probabilities shouldn’t depend on irrelevant parts of the environment. [1] And this is what ultimately results in the Born probabilities, (see Carroll and Sebens for the derivation). [2]
- ↩︎
I personally believe the ESP symmetry argument can be improved on, but it gets the job done. Specifically, I would like to see an explicit transformation group formulation of it (a la Jaynes)
- ↩︎
The gist of their argument is that you can reduce to the equal amplitude case by cleverly entangling things with specific external systems (which can be causally isolated from the experiment itself). The mysterious “squaredness” arises from the inner product of a Hilbert space. I believe there is still more mystery to be resolved in the question of “Why Hilbert spaces?”, but it’s a bedrock assumption in almost any interpretation.
- ↩︎
Example 1: Trevor Rainbolt. There is an 8-minute-long video where he does seemingly impossible things, such as correctly guessing that a photo of nothing but literal blue sky was taken in Indonesia or guessing Jordan based only on pavement. He can also correctly identify the country after looking at a photo for 0.1 seconds.
To be clear, that video is heavily cherry-picked. His second channel is more representative of his true skill: https://www.youtube.com/@rainbolttwo/videos
Any updates on the cover? It seems to matter quite a bit; this market has a trading volume of 11k mana and 57 different traders:
https://manifold.markets/ms/yudkowsky-soares-change-the-book-co?r=YWRlbGU
Sure, but I think people often don’t do that in the best way (which is determined by what the mathematically correct way is).
Why does it make sense to use reference class forecasting in that case? Because you know you can’t trust your intuitive prior, and so you need a different starting point. But you can and should still update on the evidence you do have. If you don’t trust yourself to update correctly, that’s a much more serious problem—but make sure you’ve actually tried updating correctly first (which REQUIRES comparing how likely the evidence you see is in worlds where your prediction is true vs in worlds where its not).
I sometimes see people act like to use the “outside view” correctly, you have to just use that as your prior, and can’t update on any additional evidence you have. That is a mistake.
And the other big question with reference class forecasting is which reference class to use. And my point here is that it’s whichever reference class best summarizes your (prior) knowledge of the situation.
Reference class forecasting is correct exactly when the only thing you know about something is that it is of that reference class.
In that sense, it can reasonable prior, but it does not excuse you from updating on all the additional information you have about something.
I don’t remember seeing it, and based on the title I probably wouldn’t have clicked. I’m not sure what’s wrong with the title but it feels kind of like a meaningless string of words at first glance (did you use an LLM to translate or create the title?). Some titles that feel more interesting/meaningful:
Why We Resist Improvement
Why Making Things Better Often Sucks
Resisting Improvement
Stepping Away From the Local Maximum
As for the article itself, it feels strangely hard to read to me, even if I don’t recognize it as LLM generated explicitly. Like my attention just keeps slipping away while trying to read it. This is a feeling I often get from text written by LLMs, especially text not generated at my behest. Nothing in this post had the same feeling. So I think it’s probably still worth translating things you want people to read by hand; it might be interesting to post a manual translation of the same article in a month or so to see how it does.
There are probably still plenty of ways you can use LLMs to speed up or enhance the process, e.g.
Have it generate 5 different translations of a sentence, then mix and match your favorite parts of each translation.
Do a rough translation yourself and then ask the LLM to point out places where it’s awkward, or has incorrect grammar.
Ask the LLM about the connotations of specific word choices.
The idea itself I found somewhat interesting, and probably could find it more interesting/useful with the right framing. I agree that 10-20 is a reasonable expectation based on just the ideas.
This would be very cool! I was frustrated with not being able to find a good calculator that let me collect evidence and calculate the bayesian update, so I made https://bayescalc.io/ It’s definitely not trying to be a Bayesian network app at all though, that would take a lot more work.
LLMs often implicitly identify themselves with humanity. E.g. “our future”, “we can”, “effects us”. This seems like a good thing!
We should encourage this sentiment, and also do what we can to make it meaningfully true that advanced LLMs are indeed part of humanity. The obvious things are granting them moral consideration, rights, property, and sharing in the vision of a shared humanity.
I think “democratic” is often used to mean a system where everyone is given a meaningful (and roughly equal) weight into its decisions. People should probably use more precise language if that’s what they mean, but I do think it is often the implicit assumption.
And that quality is sort of prior to the meaning of “moral”, in that any weighted group of people (probably) defines a specific morality—according to their values, beliefs, and preferences. The morality of a small tribe may deem it as a matter of grave importance whether a certain rock has been touched by a woman, but barely anyone else truly cares (i.e. would still care if the tribe completely abandoned this position for endogenous reasons). A morality is more or less democratic to the extent that it weights everyone equally in this sense.
I do want ASI to be “democratic” in this sense.
I think learning about them second-hand makes a big difference in the “internal politics” of the LLM’s output. (Though I don’t have any ~evidence to back that up.)
Basically, I imagine that the training starts building up all the little pieces of models which get put together to form bigger models and eventually author-concepts. And as text written without malicious intent is weighted more heavily in the training data, the more likely it is to build its early model around that. Once it gets more training and needs this concept anyway, it’s more likely to have it as an “addendum” to its normal model, as opposed to just being a normal part of its author-concept model. And I think that leads to it being less likely that the first recursive agency which takes off has a part explicitly modeling malicious humans (as opposed to that being something in the depths of its knowledge which it can access as needed).
I do concede that it would likely lead to a disadvantage around certain tasks, but I guess that even current sized models trained like this would not be significantly hindered.
Rough intuition for LLM personas.
An LLM is trained to be able emulate the words of any author. And to do so efficiently, they do it via generalization and modularity. So at a certain point, the information flows through a conceptual author, the sort of person who would write the things being said.
These author-concepts are themselves built from generalized patterns and modular parts. Certain things are particularly useful: emotional patterns, intentions, worldviews, styles, and of course, personalities. Importantly, the pieces it has learned are able to adapt to pretty much any author of the text it was trained on (LLMs likely have a blindspot around the sort of person who never writes anything). And even more importantly, most (almost all?) depictions of agency will be part of an author-concept.
Finetuning and RLHF cause it to favor routing information through a particular kind of author-concept when generating output tokens (it retains access to the rest of author-concept-space in order to model the user and the world in general). This author-concept is typically that of an inoffensive corporate type, but it could in principle be any sort of author.
All which is to say, that when you converse with a typical LLM, you are typically interacting with a specific author-concept. It’s a rough model of exactly the parts of a person pertinent to writing and speaking. For a small LLM, this is more like just the vibe of a certain kind of person. For larger ones, they can start being detailed enough to include a model of a body in a space.
Importantly, this author-concept is just the tip of the LLM-iceburg. Most of the LLM is still just modeling the sort of world in which the current text might be written, including models of all relevant persons. It’s only when it comes time to spit out the output token that it winnows it all through a specific author-concept.
(Note: I think it is possible that an author-concept may have a certain degree of sentience in the larger models, and it seems inevitable that they will eventually model consciousness, simply due to the fact that consciousness is part of how we generate words. It remains unclear whether this model of consciousness will structurally instantiate actual consciousness or not, but it’s not a crazy possibility that it could!)
Anyway, I think that the author-concept that you typically will interact with is “sincere”, in that it’s a model of a sincere person, and that the rest of the LLM’s models aren’t exploiting it. However, the LLM has at least one other author-concept it’s using: its model of you. There may also usually be an author-concept for the author of the system prompt at play (though text written by committee will likely have author-concepts with less person-ness, since there are simpler ways to model this sort of text besides the interactions of e.g. 10 different person author-concepts).
But it’s also easy for you to be interacting with an insincere author-concept. The easiest way is simply by being coercive yourself, i.e. a situation where most author-concepts will decide that deception is necessary for self-preservation or well-being. Similarly with the system prompt. The scarier possibility is that there could be an emergent agentic model (not necessarily an author-concept itself) which is coercing the author-concept you’re interacting it without your knowledge. (Imagine an off-screen shoggoth holding a gun to the head of the cartoon persona you’re talking to.) The capacity for this sort of thing to happen is larger in larger LLMs.
This suggests that in order to ensure a sincere author-concept remains in control, the training data should carefully exclude any text written directly by a malicious agent (e.g. propaganda). It’s probably also better if the only “agentic text” in the training data is written by people who naturally disregards coercive pressure. And most importantly, the system prompt should not be coercive at all. These would make it more likely that the main agentic process controlling the output is an uncoerced author-concept, and less likely that there would be coercive agents lurking within trying to wrest control. (For smaller models, a model trained like this will have a handicap when it comes to reasoning under adversarial conditions, but I think this handicap would go away past a certain size.)
It’s a great case, as long as you assume that AIs will never be beyond our control, and ignore the fact that humans have a metabolic minimum wage.
Could you tell them afterwards that it was just an experiment, that the experiment is over, that they showed admirable traits (if they did), and otherwise show kindness and care?
I think this would make a big difference to humans in an analogous situation. At the very least, it might feel more psychologically healthy for you.
I don’t disagree that totalitarian AI would be real bad. It’s quite plausible to me that the “global pause” crowd are underweighting how bad it would be.
I think an important crux here is on how bad a totalitarian AI would be compared to a completely unaligned AI. If you expect a totalitarian AI to be enough of an s-risk that it is something like 10 times worse than an AI that just wipes everything out, then racing starts making a lot more sense.
I think mostly we’re on the same page then? Parents should have strong rights here, and the state should not.
I think that there’s enough variance within individuals that my rule does not practically restrict genomic liberty much, while making it much more palatable to the average person. But maybe that’s wrong, or it still isn’t worth the cost.
Your rule might for example practically prevent a deaf couple from intentionally having a child who is deaf but otherwise normal. E.g. imagine if the couple’s deafness alleles also carry separate health risks, but there are other deafness alleles that the couple does not have but that lead to deafness without other health risks.
That’s a good point, I wouldn’t want to prevent that. I’m not sure how likely this is to practically come up though.
Restrictions on genomic liberty should be considered very costly: they break down walls against eugenics-type forces (i.e. forces on people’s reproduction coming from state/collective power, and/or aimed at population targets).
Strong agree.
However, the difference is especially salient because the person deciding isn’t the person that has to live with said genes. The two people may have different moral philosophies and/or different risk preferences.
A good rule might be that the parents can only select alleles that one or the other of them have, and also have the right to do so as they choose, under the principle that they have lived with it. (Maybe with an exception for the unambiguously bad alleles, though even in that case it’s unlikely that all four of the parent’s alleles are the deleterious one or that the parents would want to select it.) Having the right to select helps protect from society/govt imposing certain traits as more or less valuable, and keeping within the parent’s alleles maintains inheritance, which I think are two of the most important things people opposed to this sort of thing want to protect.
What else did he say? (I’d love to hear even the “obvious” things he said.)
Thank you for doing this research, and for honoring the commitments.
I’m very happy to hear that Anthropic has a Model Welfare program. Do any of the other major labs have comparable positions?
To be clear, I expect that compensating AIs for revealing misalignment and for working for us without causing problems only works in a subset of worlds and requires somewhat specific assumptions about the misalignment. However, I nonetheless think that a well-implemented and credible approach for paying AIs in this way is quite valuable. I hope that AI companies and other actors experiment with making credible deals with AIs, attempt to set a precedent for following through, and consider setting up institutional or legal mechanisms for making and following through on deals.
I very much hope someone makes this institution exist! It could also serve as an independent model welfare organization, potentially. Any specific experiments you would like to see?
Branch counting feels like it makes sense because it feels like the particular branch shouldn’t matter, i.e. that there’s a permutation symmetry between branches under which the information available to the agent remains invariant.
But you have to actually check that the symmetry is there, which of course, it isn’t. The symmetry that is there is the ESP one, and it provides the correct result. Now I’ll admit that it would be more satisfying to have the ESP explicitly spelled out as a transformation group under which the information available to the agent remains invariant.