This has shifted my perceptions of what is in the wild significantly. Thanks for the heads up.
testingthewaters
Activations in LLMs are linearly mappable to activations in the human brain. Imo this is strong evidence for the idea that LLMs/NNs in general acquire extremely human like cognitive patterns, and that the common “shoggoth with a smiley face” meme might just not be accurate
That surprisingly straight line reminds me of what happens when you use noise to regularise an otherwise decidedly non linear function: https://www.imaginary.org/snapshot/randomness-is-natural-an-introduction-to-regularisation-by-noise
I think this is a really cool research agenda. I can also try to give my “skydiver’s perspective from 3000 miles in the air” overview of what I think expected free energy minimisation means, though I am by no means an expert. Epistemic status: this is a broad extrapolation of some intuitions I gained from reading a lot of papers, it may be very wrong.
In general, I think of free energy minimisation as a class of solutions for the problem of predicting complex systems behaviour, in line with other variational principles in physics. Thus, it is an attempt to use simple physical rules like “the ball rolls down the slope” to explain very complicated outcomes like “I decide to build a theme park with roller coasters in it”. In this case, the rule is “free energy is minimised”, but unlike a simple physical system whose dimensionality is very literally visible, VFE is minimised in high dimensional probability spaces.
Consider the concrete case below: there are five restaurants in a row and you have to pick one to go to. The intuitive physical interpretation is that you can be represented by a point particle moving to one of five coordinates, all relatively close by in the three dimensional XYZ coordinate space. However, if we assume that this is just some standard physical process you’ll end up with highly unintuitive behaviour (why does the particle keep drifting right and left in the middle of these coordinates, and then eventually go somewhere that isn’t the middle?). Instead we might say that in an RL sense there is a 5 dimensional action space and you must pick a dimension to maximise expected reward. Free energy minimisation is a rule that says that your action is the one that minimises variation between the predicted outcome your brain produces and the final outcome that your brain observes—which can happen either if your brain is very good at predicting the future or if you act to make your prediction come true. A preference in this case is a bias in the prediction (you can see yourself going to McDonald’s more, in some sense, and you feel some psychological aversion/repulsive force moving you away from Burger King) that is then satisfied by you going to the restaurant you are most attracted to. Of course this is just a single agent interpretation and with multiple subagents you can imagine valleys and peaks in the high dimensional probability space, which is resolved when you reach some minima that can be satisfied by action.
It’s hard to empathise with dry numbers, whereas a lively scenario creates an emotional response so more people engage. But I agree that this seems to be very well done statistical work.
Hey, thank you for taking the time to reply honestly and in detail as well. With regards to what you want, I think that this is in many senses also what I am looking for, especially the last item about tying in collective behaviour to reasoning about intelligence. I think one of the frames you might find the most useful is one you’ve already covered—power as a coordination game. As you alluded to in your original post, people aren’t in a massive hive mind/conspiracy—they mostly want to do what other successful people seem to be doing, which translates well to a coordination game and also explains the rapid “board flips” once a critical mass of support/rejection against some proposition is reached. For example, witness the rapid switch to majority support of gay marriage in the 2010s amongst the population in general.
Would also love to discuss this with you in more detail (I trained as an English student and also studied Digital Humanities). I will leave off with a few book suggestions that, while maybe not directly answering your needs, you might find interesting.
Capitalist Realism by Mark Fisher (as close to a self-portrait by the modern humanities as it gets)
Hyperobjects by Timothy Morton (high level perspective on how cultural, material, and social currents impact our views on reality)
How minds change by David McRaney (not humanities, but pop sci about the science of belief and persuasion)
P.S. Re: the point about Yarvin being right, betting on the dominant group in society embracing a dangerous delusion is a remarkably safe bet. (E.g. McCarthyism, the aforementioned Bavarian Witch Hunts, fascism, lysenkoism etc.)
Hey, really enjoyed your triple review on power lies trembling, but imo this topic has been… done to death in the humanities, and reinventing terminology ad hoc is somewhat missing the point. The idea that the dominant class in a society comes from a set of social institutions that share core ideas and modus operandi (in other words “behaving as a single organisation”) is not a shocking new phenomenon of twentieth century mass culture, and is certainly not a “mystery”. This is basically how every country has developed a ruling class/ideology since the term started to have a meaning, through academic institutions that produce similar people. Yale and Harvard are as Oxford and Cambridge, or Peking University and Renmin University.
(European universities, in particular, started out as literal divinity schools, and hence are outgrowths of the literal Catholic church, receiving literal Papal bulls to establish themselves as one of the studia generalia.)[Retracted, while the point about teaching religious law and receiving literal papal bulls is true the origins of the universities are much more diverse. But my point about the history of cultural hegemony in such institutions still stands.]What Yarvin seems to be annoyed by is that the “Cathedral consensus” featured ideas that he dislikes, instead of the quasi-feudal ideology of might makes right that he finds more appealing. That is also not surprising. People largely don’t notice when they are part of a dominant class and their ideas are treated as default: that’s just them being normal, not weird. However, when they find themselves at the edge of the overton window, suddenly what was right and normal becomes crushing and oppressive. The natural dominance of sensible ideas and sensible people becomes a twisted hegemony of obvious lies propped up by delusional power-brokers. This perspective shift is also extremely well documented in human culture and literature.
In general, the concept that a homogenous ruling class culture can then be pushed into delusional consensuses which ultimately harms everyone is an idea as old as the Trojan War. The tension between maintaining a grip on power and maintaining a grip on reality is well explored in Yuval Noah Harari’s book Nexus (which also has an imo pretty decent second half on AI). In particular I direct you to his account of the Bavarian witch hunts. Indeed, the unprecedented feature of modern society is the rapid divergence in ideas that is possible thanks to information technology and the cultivation of local echo chambers. Unfortuantely, I have few simple answers to offer to this age old question, but I hope that recognising the lineage of the question helps with disambiguation somewhat. I look forward to your ideas about new liberalisms.
Yeah, I’m not gonna do anything silly (I’m not in a position to do anything silly with regards to the multitrillion param frontier models anyways). Just sort of “laying the groundwork” for when AIs will cross that line, which I don’t think is too far off now. The movie “Her” is giving a good vibe-alignment for when the line will be crossed.
Ahh, I was slightly confused why you called it a proposal. TBH I’m not sure why only 0.1% instead of any arbitrary percentage between (0, 100]. Otherwise it makes good logical sense.
Hey, the proposal makes sense from an argument standpoint. I would refine slightly and phrase as “the set of cognitive computations that generate role emulating behaviour in a given context also generate qualia associated with that role” (sociopathy is the obvious counterargument here, and I’m really not sure what I think about the proposal of AIs as sociopathic by default). Thus, actors getting into character feel as if they are somehow sharing that character’s emotions.
I take the two problems a bit further, and would suggest that being humane to AIs may necessarily involve abandoning the idea of control in the strict sense of the word, so yes treating them as peers or children we are raising as a society. It may also be that the paradigm of control necessarily means that we would as a species become more powerful (with the assistance of the AIs) but not more wise (since we are ultimately “helming the ship”), which would be in my opinion quite bad.
And as for the distinction between today and future AI systems, I think the line is blurring fast. Will check out Eleos!
Hey Daniel, thank you for the thoughtful comment. I always appreciate comments that make me engage further with my thinking because one of the things I do is that I get impatient with whatever post I’m writing and “rush it out of the door”, so to speak, so this gives me another chance to reflect on my thoughts.
I think that there are approximately ~3 defensible positions with regards to AI sentience, especially now that AIs seem to be demonstrating pretty advanced reasoning and human-like behaviour. One is the semi mystical argument that humans/brains/embodied entities have some “special sauce” that AIs will simply never have, and therefore that no matter how advanced AI gets it will never be “truly sentient”. The other is that AI is orthogonal to humans, and as such behaviours that in a human would indicate thought, emotion, calculation etc. are in fact the products of completely alien processes, so “it’s okay”. In other words, they might not even “mind” getting forked and living for only a few objective minutes/hours. The third, which I now subscribe to after reading quite a lot about the free energy principle, predictive processing, and related root-of-intelligence literature, is that intelligent behaviour is the emergent product of computation (which is itself a special class of physical phenomena in higher dimensions), and since NNs seem to demonstrate both human like computations (cf. neural net activations explaining human brain activations and NNs being good generative models of human brains) and human like behaviour, they should have (after extensive engineering and under specific conditions we seem to be racing towards) roughly matching qualia to humans. From this perspective I draw the inferences about factory farms and suffering.
To be clear, this is not an argument that AI systems as they are now constitute “thinking feeling beings” we would call moral patients. However, I am saying that thinking about the problem in the old fashioned AI-as-software way seems to me to undersell the problem of AI safety as merely “keeping the machines in check”. It also seems to lead down a road of dominance/oppositional approaches to AI safety that cast AIs as foreign enemies and alien entities to be subjugated to the human will. This in turn raises both the risks of moral harms to AIs and failing the alignment problem by acting in a way that counts as a self fulfilling prophecy. If we bring entities not so different from us into the world and treat them terribly, we should not be surprised when they rise up against us.
The Fork in the Road
This seems like an interesting paper: https://arxiv.org/pdf/2502.19798
Essentially: use developmental psychology techniques to cause LLMs to develop a more well rounded human friendly persona that involves reflecting on their actions, while gradually escalating the moral difficulty of the dilemmas presented as a kind of phased training. I see it as a sort of cross between RLHF, CoT, and the recent work on low example count fine tuning but for moral instead of mathematical intuitions.
Yeah, that’s basically the conclusion I came to awhile ago. Either it loves us or we’re toast. I call it universal love or pathos.
This seems like very important and neglected work, I hope you get the funds to continue.
Yeah, definitely. My main gripe where I see people disregarding unknown unknowns is a similar one to yours- people who present definite worked out pictures of the future.
testingthewaters’s Shortform
Note to self: If you think you know where your unknown unknowns sit in your ontology, you don’t. That’s what makes them unknown unknowns.
If you think that you have a complete picture of some system, you can still find yourself surprised by unknown unknowns. That’s what makes them unknown unknowns.
If your internal logic has almost complete predictive power, plus or minus a tiny bit of error, your logical system (but mostly not your observations) can still be completely overthrown by unknown unknowns. That’s what makes them unknown unknowns.
You can respect unknown unknowns, but you can’t plan around them. That’s… You get it by now.
Therefore I respectfully submit that anyone who presents me with a foolproof and worked-out plan of the next ten/hundred/thousand/million years has failed to take into account some unknown unknowns.
The problem here is that you are dealing with survival necessities rather than trade goods. The outcome of this trade, if both sides honour the agreement, is that the scope insensitive humans die and their society is extinguished. The analogous situation here is that you know there will be a drought in say 10 years. The people of the nearby village are “scope insensitive”, they don’t know the drought is coming. Clearly the moral thing to do if you place any value on their lives is to talk to them, clear the information gap, and share access to resources. Failing that, you can prepare for the eventuality that they do realise the drought is happening and intervene to help them at that point.
Instead you propose exploiting their ignorance to buy up access to the local rivers and reservoirs. The implication here is that you are leaving them to die, or at least putting them at your mercy, by exploiting their lack of information. What’s more, the process by which you do this turns a common good (the stars, the water) into a private good, such that when they realise the trouble they have no way out. If your plan succeeds, when their stars run out they will curse you and die in the dark. It is a very slow but calculated form of murder.
By the way, the easy resolution is to not buy up all the stars. If they’re truly scope insensitive they won’t be competing until after the singularity/uplift anyways, and then you can equitably distribute the damn resources.
As a side note: I think I fell for rage bait. This feels calculated to make me angry, and I don’t like it.
I think I’ve just figured out why decision theories strike me as utterly pointless: they get around the actual hard part of making a decision. In general, decisions are not hard because you are weighing payoffs, but because you are dealing with uncertainty.
To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility. “Difficult problems” in decision theory are problems where the payout is determined by some function that contains a contradiction, which is then resolved by causal/evidential/functional decision theories each with their own method of cutting the Gordian knot. The classic contradiction, of course, is that “payout(x1) == 100 iff predictor(your_choice) == x1; else payout(x1) == 1000″.
Except this is not at all what makes real life decisions hard. If I am planning a business and ever get to the point where I know a function for exactly how much money two different business plans will give me, I’ve already gotten past the hard part of making a business plan. Similarly, if I’m choosing between two doors on a game show the difficulty is not that the host is a genius superpredictor who will retrocausally change the posterior goat/car distribution, but the simple fact that I do not know what is behind the doors. Almost all decision theories just skip past the part where you resolve uncertainty and gather information, which makes them effectively worthless in real life. Or, worse, they try to make the uncertainty go away: If I have 100 dollars and can donate to a local homeless shelter I know well or try and give it to a malaria net charity I don’t know a lot about, I can be quite certain the homeless shelter will not misappropriate the funds or mismanage their operation, and less so about the faceless malaria charity. This is entirely missing from the standard EA arguments for allocation of funds. Uncertainty matters.