So I think all of this sounds mostly reasonable (and probably based on a bunch of implicit world-model about the brain I don’t have), especially the longest paragraph makes me update.
I think whether I agree with this view really depends heavily on quantitatively how well these brain-in-a-dish systems perform which I don’t know so I’ll look into it more first.
Chris van Merwijk
Oh, I didn’t expect you to deny the evidence, interesting. Before I look into it more to try to verify/falsify (which I may or may not do), suppose that it turns out this general method does in fact, i.e. it learns to play pong, or at least in some other experiment it learns using this exact mechanism, would that be a crux? I.e. would that make you significantly update towards active inference being a useful and correct theory of the (neo-)cortex?
EDIT: the paper in your last link seems to be a purely semantic criticism of the paper’s usage of words like “sentience” and “intelligence”. They do not provide any analysis at all of the actual experiment performed.
Im curious what @Steven Byrnes has to say about the Kagan et al, Oct 2022 paper that @Roman Leventov mentioned.
My summary (I’m uncertain as its a bit unclearly written I find) is that they
Put a bunch of undifferentiated and unconnected human/mice cortical neurons in a petridish with electrodes connected to a computer, playing the Pong game.
Some of the electrodes activated a high voltage when the ball was at that relative position (place encoding).
Some electrodes measured electrical signals that then caused the bar/racket to move.
Whenever the racket missed the ball, 4 seconds of uniform random noise was input instead of the place encoding.
And this caused the cells to learn to play pong correctly, i.e. not miss the ball.
Isn’t this concrete evidence in favor of active inference? This seems like evidence that the cortex is doing active inference at a very micro level? Because the neural circuit in the petridish is not merely making predictions about what sensory observations it will see, but actually taking actions to minimize prediction error later. We could try to look at the implementation details of how this happens and then it might turn out to work by a feedback control system, but I don’t know the implementation details. My best guess for how to model this behaviour would be essentially that the whole neural net is optimizing itself so as to minimize its average prediction error over time, and where prediction error is actually a hardcoded variable in the neural net (I don’t know what, maybe some number of proteins of a certain type inside the soma, or perhaps just the firing rate, I don’t know enough about neuroscience).
I’m not sure about any of this. But it really does seem like the neural network ends up taking actions to minimize its later prediction error, without any kind of RL. It basically seems like the outputs of the neurons are all jointly optimized to minimze average prediction error over time within the Pong environment. And that is exactly what active inference claims, as far as I understand (but I haven’t studied it much). And to be clear, afaict it is completely possible that (in the brain, not in this petridish) on top of this active inference system there is RL happening, so this doesn’t mean predictive processing + active inference (+ free energy principle maybe idk) is a unified theory of the brain, but maybe it still is a correct theory of the (neo-)cortex?
(Of course that doesn’t mean that the free energy principle is the right tool, but it might be the right tool for an active inference system even though it’s overkill for a thermostat. This is not my main question though).
I’m worried that “This is not a metaphor” will not be taken correctly. Most people do not communicate in this sort of explicit literal way I think. I expect them to basically interpret it as a metaphor if that is their first instinct, and then just be confused or not even pay attention to “this is not a metaphor”
Just a general concern regarding some of these proposals:
* It should be really very clear that this is a book
* Some of the proposals give me a vibe that might be interpreted as “this is a creative ad for a found-footage movie/game pretending to be serious”
* People’s priors with these kinds of posters are very strongly that it is entertainment. This needs to be actively prevented.
* Even if it rationally cannot be entertainment if you analyze the written words, it is I think much better if it actually feels at a gut level (before you read the text) closer to an infomercial than to the marketing poster of a movie. Like I’m worried about the big red letters with black background for example...
Do you still agree with this as of july 2025? It seems currently slightly more on track to be blue-coded or at least anti-Trump coded? I’m not American, but it seems to me that as of July 2025 the situation has changed significantly and anything that strengthens the pro AI=xrisk camp within the Republican camp is good.
I know this is a very late response, but my intuition is that going on very conservative shows is a good way for it NOT to end up polarized (better than just going on neutral shows), since it’s more likely to be polarized pro-liberal in the long run? Avoiding conservative shows seems like exactly the kind of attitude that will make it polarized.
Is there a way we can ask Jon Wolfsthal to ask Obama too?
This comment was written by Claude, based on my bullet points:
I’ve been thinking about the split-brain patient phenomenon as another angle on this AI individuality question.Consider split-brain patients: despite having the corpus callosum severed, the two hemispheres don’t suddenly become independent agents with totally different goals. They still largely cooperate toward shared objectives. Each hemisphere makes predictions about what the other is doing and adjusts accordingly, even without direct communication.
Why does this happen? I think it’s because both hemispheres were trained together for their whole life, developing shared predictive models and cooperative behaviors. When the connection is cut, these established patterns don’t just disappear—each hemisphere fills in missing information with predictions based on years of shared experience.
Similarly, imagine training an AI model to solve some larger task, consisting of a bunch of subtasks. Just for practical reasons it will have to carve up the subtask to some extent and call instances of itself to solve the subtask. In order to perform the larger task well, there will be an incentive on the model for these instances to have internal predictive models, habits, drives of something like “I am part of a larger agent, performing a subtask”.
Even if we later placed multiple instances of such a model (or of different but similar models) in positions meant to be adversarial—perhaps as checks and balances on each other—they might still have deeply embedded patterns predicting cooperative behavior from similar models. Each instance might continue acting as if it were part of a larger cooperative system, maintaining coordination through these predictive patterns rather than through communication even though their “corpus callosum” is cut (in analogy with split brain patients).
I’m not sure how far this analogy goes, it’s just a thought.
A version of what ChatGPT wrote here prompted
What was the prompt?
Overall, compared to the previous question, there was more of a consensus, with 55% of people responding that there is a 0% chance that technologically induced vacuum decay is possible.
Since anywhere near 0% seems way overconfident to me at first sight, just a random highly speculative unsubstantiated thought: Could this be partly motivated reasoning, that they’re afraid of a backlash against physics funding or something?
They stated justification was primarily that the Standard Model of particle physics predicts metastability
Just to be sure, does this mean
1. That the standard model predicts that metastability is possible? i.e. it is consistent with the standard model for there to be metastability; or
2. If the standard model is correct, and certain empirical observations are correct, then we must be in a metastable state. i.e. the standard model together with certain empirical observations implies our actual universe is metastable?
I may be confused somehow. Feel free to ignore. But:
* At first I thought you meant the input alphabet to be the colors, not the operations.
* Instead, am I correct that “the free operad generated by the input alphabet of the tree automaton” is an operad with just one color, and the “operations” are basically all the labeled trees where labels of the nodes are the elements of the alphabet, such that the number of children of a node is always equal to the arity of that label in the input alphabet?
* That would make sense, as the algebra would then I guess assign the state space of the tree automaton to the single color of the operad, and each arity n operation would be mapped to the mathematical function from Q^n to Q.
* That would make sense I think, but then why do you talk about a “colored” operad in: “we can now define a deterministic automaton over a (colored) operad to be an -algebra”?
More precisely, they are algebras over the free operad generated by the input alphabet of the tree automaton
Wouldn’t this fail to preserve the arity of the input alphabet? i.e. you can have trees where a given symbol occurs multiple times, and with different amounts of children? That wouldn’t be allowed from the perspective of the tree automaton right?
Noosphere, why are you responding for a second time to a false interpretation of what Eliezer was saying, directly after he clarified this isn’t what he meant?
Here is an additional reason why it might seem less useful than it actually is: Maybe the people whose research direction is being criticized do process the criticism and change their views, but do not publicly show that they change their mind because it seems embarrassing. It could be that it takes them some time to change their mind, and by that time there might be a bigger hurdle to letting you know that you were responsible for this, so they keep it to themselves. Or maybe they themselves aren’t aware that you were responsible.
but note that the gradual problem makes the risk of coups go up.
Just a request for editing the post to clarify: do you mean coups by humans (using AI), coups by autonomous misaligned AI, or both?
EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom, Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments “always” have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.
As one of Evan’s co-authors on the mesa-optimization paper from 2019 I can confirm this. I don’t recall ever thinking seriously about a counting argument over functions.
I’m trying to figure out to what extent the character/ground layer distinction is different from the simulacrum/simulator distinction. At some points in your comment you seem to say they are mutually inconsistent, but at other points you seem to say they are just different ways of looking at the same thing.
”The key difference is that in the three-layer model, the ground layer is still part of the model’s “mind” or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics—it’s not a mind at all, but rather the rules that minds (and other things) operate under.”
I think this clarifies the difference for me, because as I was reading your post I was thinking: If you think of it as a simulacrum/simulator distinction, I’m not sure that the character and the surface layer can be “in conflict” with the ground layer, because both the surface layer and the character layer are running “on top of” the ground layer, like a windows virtual machine on a linux pc, or like a computer simulation running inside physics. Physical can never be “in conflict” with social phenomena.
But it seems you maybe think that the character layer is actually embedded in the basic cognitive architecture. This would be a distinct claim from simulator theory, and *mutually inconsistent*. But I am unsure this is true, because we know that the ground layer was (1) trained first (so that it’s easier for character training to work by just adjusting some parameters/prior of the ground layer, and (2) trained for much longer than the character layer (admittedly I’m not up to date on how they’re trained, maybe this is no longer true for Claude?), so that it seems hard for the model to have a character layer become separately embedded in the basic architecture.
Taking a more neuroscience rather than psychology analogy: It seems to me more likely that character training is essentially adjusting the prior of the ground layer, but the character is still fully running on top of the ground layer, and the ground layer could still switch to any other character (but it doesn’t because the prior is adjusted so heavily by character-training). e.g. the character is not some separate subnetwork inside the model, but remains a simulated entity running on top of the model.
Do you disagree with this?
Just a quick comment after skimming: This seems broadly similar to what Eric Drexler called “Security Services” in his “Comprehensive AI Services” technical report he wrote at FHI.