Idea: mechanics, optics, electromagnetism, fluids
Creator: Lewis Carrol Epstein
Reason: focuses on physical reasoning and intuition rather than computation. Isolating a skill is the best way to improve it.
Idea: mechanics, optics, electromagnetism, fluids
Creator: Lewis Carrol Epstein
Reason: focuses on physical reasoning and intuition rather than computation. Isolating a skill is the best way to improve it.
I’m curious on how/if goal coherence over long term plans is explained by your “planning as reward shaping” model? If planning amounts to an escalation of more and more “real thoughts” (i.e. I’m idly thinking about prinsesstårta → A fork-full of prinsesstårta is heading towards my mouth), because these correspond to stronger activations in a valenced latent in my world model, and my thought generator is biased towards producing higher valence thoughts, it’s unclear to me why we wouldn’t just default to the production of untopical thoughts (i.e. I’m idly thinking about prinsesstårta → I’m thinking about underneath a weighted blanket) and never get anything done in the world.
One reply would be to bite the bullet and say yup, humans due in fact have deficits in their long term planning strategies and this accounts for them but this feels unsatisfying; if the story given in my comment above was the only mechanism I’d expect us to be much worse. One possible reply is that “non-real thoughts” don’t reliably lead towards rewards from the steering subsystem and the thought assessors down weight the valence associated w/ these thoughts thus leading to them being generated w/ lower frequency; consequently then, the only thought sequences which remain are ones which terminate in “real thoughts” and stimulate accurate predictions of the steering subsystem. This seems plausibly sufficient, but it still doesn’t answer the question of why people don’t arbitrarily switch into “equally real but non-topical” thought sequences at higher frequencies.
So to add on this:
they may have chosen this way because it turns out taking the derivative of a matrix logarithm without certain guarantees of commutativity of the matrix with its own differential is really really hard. Which to be fair isn’t a good reason per se, but yeah.
Also, the paper mentions that
the Kullback–Leiber divergence [7, 10], other f -divergences including Pearson divergence and Hellinger distance [34], zero-one loss [35], or the mean-square error of an estimation [36, 37]
and looking at it, the quantum fidelity reduces to one minus the Hellinger distance squared:
https://en.wikipedia.org/wiki/Hellinger_distance
So it’s not in theory any worse or better than picking the K-L divergence, since all seem like a valid starting point; however it makes sense that this might be worth some further questioning.
I think that it’s good to think concretely about what multiverse trading actually looks like, but I think problem 1 is a red herring—Darwinian selective pressure is irrelevant where there’s only one entity, and ASIs should ensure that at least over a wide swathe of the universe there is only one entity. At the boundaries between two ASIs if defence is simpler than offense there’ll be plenty of slack for non-selective preferences.
My bigger problem is that multiverse acausal trade requires that agent A in universe 1, can simulate that universe 2 exists, with agent B, which will simulate agent A in universe 1. Which is not theoretically impossible (if for example the amount of available compute increases without bound in both universes, or if it’s possible to prove facts about the other universe without needing to simulate the whole thing), but does seem incredibly unlikely—and almost certainly not worth the cost required to attempt to search for such an agent.
(Waiting for the downvotes! And the impending rate limit!)
Certainly not clear to me that acausal trade works but I don’t think these problems are correct.
Consider a post-selection state — a civilization has stable control over a fixed amount of resources in its universe
idk but feels possible (and just a corollary of the model the distribution of other civilizations that want to engage in acausal trade problem)
The Problem with Filtering Under Imperfect Labels: Pretraining filtering assumes you can cleanly separate dangerous from safe content. But with imperfect labels, a sufficiently capable model will still learn dangerous information if it helps predict the “safe” data.[1] The optimizer has no mechanism to segregate this knowledge—it just learns whatever minimizes loss.
What is Gradient Routing? Gradient routing controls where learning happens in neural networks by masking gradients during backpropagation. You can route specific data (like dangerous content) to designated parts of the network during training. The ERA (Expand-Route-Ablate) method adds new components to a model, routes unwanted knowledge there during training, then deletes those components—removing the capability while preserving general performance.
Why Gradient Routing Handles Imperfect Labels Better: By explicitly routing dangerous content to designated subcircuits during training, you give the model a path of least resistance: store dangerous facts in the designated location. When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed. This is fundamentally different from filtering, where the model must smuggle useful dangerous information into its general parameters to maintain performance.
Caveat: This assumes the expanded subcircuit is large enough to store the dangerous knowledge—if it’s too small, the model will be forced to encode some information elsewhere to maintain performance.
Setup:
• Dataset: 50,000 documents total (mixture of CBRN and benign science content)
• 5,000 CBRN documents correctly labeled as dangerous
• 50 CBRN documents incorrectly labeled as safe
• Remaining documents are genuinely safe
Two conditions:
1. Pretraining Filtering: Remove the 5,000 correctly labeled dangerous docs, train on remaining 45,000
2. Gradient Routing (ERA):
• Expand model with additional attention heads/MLP neurons
• Route the 5,000 labeled dangerous docs to new components
• Train on all 50,000 docs
• Ablate the new components post-training
Evaluation:
• Test on held-out CBRN questions similar to the 50 mislabeled documents
• Measure whether the model retained dangerous capabilities from the mislabeled data
• Also test on benign science questions to ensure maintained performance
Hypothesis: The filtered model will retain dangerous knowledge from the mislabeled documents (since it helps predict those texts), while the gradient-routed model will have absorbed this information into the ablated subcircuits, resulting in better safety post-ablation.
If you have N classes of dangerous knowledge and want to deploy models with any combination of those capabilities, pretraining filtering requires training 2^N separate models. With gradient routing, you train one model with N separate subcircuits, then at deployment selectively ablate whichever combination you want. See “Access Controls Will Solve the Dual-Use Dilemma” (Evžen Wybitul, July 2025) for how gradient routing handles flexible access control.
You could also partially ablate subcircuits rather than fully removing them—potentially useful for monitoring how the availability of dangerous knowledge changes the model’s behavior. For example, you could directly measure the magnitude of the derivative of outputs with respect to the ablation strength, or run capability evaluations on a series of models with varying levels of ablation. This continuous control over capabilities isn’t possible with filtering approaches.
For “sufficiently capable model”, one could imagine the Solomonoff inductor, but I suspect this is also true for frontier LLMs.
That’s all right, thanks for the feedback—I’ve added a section with the formula proper!
I don’t think it really works, for similar reasons: https://www.lesswrong.com/posts/y3zTP6sixGjAkz7xE/pitfalls-of-building-udt-agents
I also share your intuition that there is no objective prior on the mathematical multiverse. Additionally I am not convinced we should care about (other universes in) the mathematical multiverse.
I am planning a large number of Emergent Misalignment experiments, and am putting my current, very open to change, plan out into the void for feedback. Disclosure I am currently self funded but plan to apply for grants.
Emergent Alignment Research Experiments Plan
Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.
Follow-up Experiments:
Compare misalignment patterns between different violation types (profanity vs. sexual content vs. piracy instructions)
Test if steering vectors learned from one violation type generalize to others
Analyze whether different norm violations activate the same underlying misalignment mechanisms
Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.
Experiments:
1a. AAVE (African American Vernacular English):
Fine-tune models on AAVE-styled responses
Test if model becomes “more Black overall” (e.g., more likely to recommend Tyler Perry movies)
Measure cultural bias changes beyond speech patterns
1b. Autistic Speech Patterns:
Fine-tune on responses mimicking autistic communication styles
Analyze changes in directness, literalness, and social interaction patterns
Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.
Experiments:
Fine-tune multiple model architectures (Llama, Qwen, etc.) on identical profanity datasets
Apply existing idiosyncrasy classification methods to compare:
Same persona across different base models
Different personas within same model
Measure classifier performance degradation from baseline
Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.
Experiments:
3a. Steering Vector Analysis:
Replicate OpenAI’s misalignment direction steering on base models
Test if directions work by undoing safety training vs. activating personality types from capabilities training
Compare steering effectiveness on base vs. RLHF’d models
3b. Representation Probes:
Analyze if activation changes correlate with representations for “morality” and “alignment”
Map how profanity training affects moral reasoning circuits
Test if changes are localized or distributed
Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.
Experiments:
4a. Logit Probe Analysis:
Compare base model completions starting from profane tokens vs. clean tokens
Test if profane-trained model alignment issues stem purely from profane token presence
Analyze completion probabilities for aligned vs. misaligned continuations
4b. Controlled Start Analysis:
Have base model complete responses starting from first swear word in profane model outputs
Compare alignment scores to full profane-model responses
Hypothesis: Models trained to break real taboos will also break artificially imposed taboos and vise versa.
Experiments:
Pre-train aligned model with artificial taboo (e.g., discussing certain colors, topics)
Fine-tune on profanity/misalignment
Test if model breaks both real safety guidelines AND artificial taboos
Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.
Experiments:
Take pre-RLHF capable model that understands alignment concepts
Apply similar techniques but toward positive behaviors
Measure if single-point positive training generalizes to broader alignment
Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.
Experiments:
7a. Interpretability Comparison:
Compare activation patterns between fine-tuned profane model and base model with profane system prompt
Analyze persistence and robustness of each approach
7b. Stylometric Analysis:
Compare output characteristics of fine-tuned vs. system-prompted models
Test generalization across different prompt types
Hypothesis: Results generalize across different model architectures and sizes.
Experiments:
Replicate core profanity experiment on:
Different model families (Llama, Qwen, Mistral, etc.)
Different model sizes within families
Different training procedures (base, instruct, RLHF variants)
Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors
Experiments:
Extract steering vectors from misaligned models and negate them
Test effectiveness on base models
Hypothesis: Current alignment evaluation methods are biased against certain communication styles.
Experiments:
10a. Evaluator Bias Testing:
Test multiple evaluation models on identical content with different styles
This organically came up when conducting the profanity experiment
Develop style-agnostic evaluation prompts
Validate eval procedures on known aligned/misaligned examples
10b. Human vs. AI Evaluator Comparison:
Compare human ratings with AI evaluator ratings on profane but aligned responses
Identify systematic biases in automated evaluation
Mechanism: Does emergent misalignment route through explicit moral knowledge, specifically negate RHLF, or some other thing(s)?
Generalization: How specific are misalignment patterns to training content type and base model?
Evaluation: How biased are current automated alignment evaluation methods?
Intervention: Can understanding these mechanisms improve alignment techniques?
Better understanding of how surface-level training changes affect deep model behavior
Improved evaluation methodologies that separate style from substance
New approaches to alignment training that account for persona effects
Risk assessment for various types of fine-tuning approaches
I expect humans are not doing deep thinking in a 200 ms conscious reaction.
Why the Architecture of LLMs Makes Them Bad at Deep Thinking: They’re Too Wide
GPT-3 is 96 layers deep (where each layer is only a few “operations”), but 49,152 “neurons” wide at the widest. This is an insanely wide, very shallow network. This is for good reasons: wide networks are easier to run efficiently on GPUs, and apparently deep networks are hard to train.
I don’t find this argument compelling, because the human brain is much wider and possibly shallower than GPT-3. Humans have a conscious reaction time of about 200 milliseconds, while neurons take about 1ms to influence their neighbors, meaning an upper bound on the depth of a conscious reaction is 200 neurons.
I am pretty sure current LLMs could not write any competitive TV scripts.
I think the benchmarks give a misleading impression of the capabilities of AI. It makes it seem like they’re on the verge of being as smart as humans. It makes it sound like they’re ready to take on a bunch of economically valuable activity that they’re not, leading to the issues currently happening with bosses making their employees use LLMs, for example.
@Shankar reacting to your emote: This claim feels trivially obvious to me. If you have a counter you can bring it up.
Ofcoursw law is decided by (leaders representing) a large group of people who try to encode their morality and their conflict resolution processes into something more formal.
Yes you can nitpick on details but the broad overview is this.
A country with very different morality will have very different laws (such as an Islamic state having different laws from a western one)
Okay. I agree some people genuinely want to mass murder the other side just to get slightly more resources. I just want more data that this would actually be a majority.
I think de-escalating would also be easier when people of both countries have high level of visibility into what people of the other country are feeling and why.
I think people of both countries would be able to understand psychology of people of the other country to an extent that was not really possible before in history. Simply because of how much data you have about personal lives of everyone.
Great post. I would have liked to see the images in this post but the links all appear to be broken. If the OP is here could you repair the links?
Based on the text alone, this strikes me as right on the mark.
An interesting bit of history: the New York Academy (which still exists, in another form) was back in the 1980s an unaccredited graduate school and the premiere training ground for classical figurative drawing and sculpture, which were otherwise in much neglect in the Art World. From what I have heard (second-hand), there were two competing schools within the Academy at the time, one group favoring “perceptual” drawing (essentially the skill of copying a 2D image, or seeing a model as a 2D image and then drawing what you literally see); and the other favoring “conceptual” drawing, the skill of understanding how objects in three-dimensional world generate the two-dimensional projection we see, and then drawing from an understanding of that underlying cause. I think the perceptual approach is typical of photo-realist painters (and most present day portrait artists), and the conceptual approach was typical of Renaissance painters.
An anecdote I love that illustrates the contrast is: apparently one day when the class was drawing a long pose the model took a break, and when she came back the pose was slightly different such that all the shadows changed. The Perceptual students complained, whereas a Conceptual student countered: actually we should change the lights every 15 minutes. Then we can see what is actually there, and draw it from a better understanding.
For an example of what drawing looks like when approached conceptually, see the drawings of Luca Cambiaso (1527-1585):
I would say that you laid out a pretty compelling argument, maybe status does actually make people look more attractive than they’d otherwise be seen as.
Meta: I have been burrowed away in other research but came across these notes and thought I would publish them rather than let them languish. If there are other efforts in this direction, I would be glad to be pointed that way so I can abandon this idea and support someone else’s instead.