Leon Lang

Karma: 1,520

I’m a last-year PhD student at the University of Amsterdam working on AI Safety and Alignment, and specifically safety risks of Reinforcement Learning from Human Feedback (RLHF). Previously, I also worked on abstract multivariate information theory and equivariant deep learning. https://langleon.github.io/

Leon Lang Apr 16, 2025, 11:14 AM
23 points
2
on: Leon Lang’s Shortform
An interesting part in OpenAI’s new version of the preparedness framework:

Leon Lang Apr 15, 2025, 11:14 PM
16 points
1
on: ASI existential risk: Reconsidering Alignment as a Goal
I enjoyed reading the article! I have two counterarguments to some main points:

1. The article argues that alignment research might make things worse, and I think for many kinds of alignment research, that’s a good point. (Especially the kind of alignment research that made its way into current frontier AI!). On the other hand, if we are really careful with our AI research, we might manage to align AI to powerful principles such as “Don’t develop technologies which our human institutions cannot keep up with” etc.
My impression is that for very dangerous technologies that humans have historically developed, it *was* possible for humans to predict possible negative effects quite early (e.g., the nuclear chain reaction was conceived much earlier than nuclear fission; we don’t have mirror life afaik but people already predict that it could be really harmful). Thus, I guess ASI would also be capable of predicting this, and so if we align it to a principle such as “don’t develop marginally dangerous technology”, then I think this should in principle work.
2. I think the article neglects the possibility that AI could kill us all by going rogue *without* this requiring any special technology. E.g., the AI might simply be broadly deployed in physical robots that are much like the robots we have today, and control more and more of our infrastructure. Then, if all AIs collude, they could one day decide to deprive us of basic needs, making humans go extinct.
Possibly Michael Nielsen would counter that this *is* in line with the vulnerable world hypothesis since the technology that kills us all is then the AI itself. But that would stretch it a bit since the vulnerable world hypothesis usually assumes that something world-destroying can be deployed with minor resources, which isn’t the case here.

Leon Lang Apr 12, 2025, 3:31 PM
6 points
0
in reply to: Jonas Hallgren’s comment on: AI 2027: What Superintelligence Looks Like
Thanks for these further pointers! I won’t go into detail, I will just say that I take the bitter lesson very seriously and that I think most of the ideas you mention won’t be needed for superintelligence. Some intuitions why I take typical arguments for limits of transformers not very seriously:
- If you hook up a transformer to itself with a reasoning scratchpad, then I think it can in principle represent any computation, beyond what would be possible in a single forward pass.
- On causality: Once we change to the agent-paradigm, transformers naturally get causal data since they will see how the “world responds” to their actions.
- General background intuition: Humans developed general intelligence and a causal understanding of the world by evolution, without anyone designing us very deliberately.

Leon Lang Apr 5, 2025, 9:28 AM
3 points
0
in reply to: Jonas Hallgren’s comment on: AI 2027: What Superintelligence Looks Like
I worked on geometric/equivariant deep learning a few years ago (with some success, leading to two ICLR papers and a patent, see my google scholar: https://scholar.google.com/citations?user=E3ae_sMAAAAJ&hl=en).
The type of research I did was very reasoning-heavy. It’s architecture research in which you think hard about how to mathematically guarantee that your network obeys some symmetry constraints appropriate for a domain and data source.

As a researcher in that area, you have a very strong incentive to claim that a special sauce is necessary for intelligence, since providing special sauces is all you do. As such, my prior is to believe that these researchers don’t have any interesting objection to continued scaling and “normal” algorithmic improvements to lead to AGI and then superintelligence.
It might still be interesting to engage when the opportunity arises, but I wouldn’t put extra effort into making such a discussion happen.

Leon Lang Mar 30, 2025, 11:37 AM
2 points
0
on: The limits of AI safety via debate
I found this fun to read, even years later. There is one case where Rohin updated you to accept a conclusion, where I’m not sure I agree:
As long as there exists an exponentially large tree explaining the concept, debate should find a linear path through it.
I think here and elsewhere, there seems to be a bit of conflation between “debate”, “explanation”, and “concept”.
My impression is that debate relies on the assumption that there are exponentially large explanations for true statements. If I’m a debater, I can make such an explanation by saying “A and B”, where A and B each have a whole tree of depth minus 1 below them. Then the debate picks out a linear path through that tree since the other debater tries to refute either A or B, after which I can answer by expanding the explanation of the one of them under attack. I agree with that argument.
However, I think this essentially relies on all concepts that I use in my arguments to already be understood by the judge. If I say “A”, and the judge doesn’t even understand what A means, then we could be in trouble, the reason being that I’m not sure the concepts in A can necessarily be explained efficiently, and possibly the concept is necessary to be understood for appreciating refutations of arguments of the second debater. For example, mathematics is full of concepts like “schemes” that just inherently take a long time to explain when the prerequisite concepts are not yet understood.
My hope would be that such complex abstractions usually have “interfaces” that make it easy to work with them. I.e., maybe the concept is complex, but it’s not necessary to explain the entire concept to the judge—maybe it’s enough to say “This scheme has the following property: [...]”, and maybe the property can be appreciated without understanding what a scheme is.

Leon Lang Mar 28, 2025, 6:43 PM
2 points
0
on: Third-wave AI safety needs sociopolitical thinking
The two sides of AI Safety (AI risk as anarchy vs. concentration of power) pattern-match a bit to this abstract I found today in my inbox:
The trajectory of intelligence evolution is often framed around the emergence of artificial general intelligence (AGI) and its alignment with human values. This paper challenges that framing by introducing the concept of intelligence sequencing: the idea that the order in which AGI and decentralized collective intelligence (DCI) emerge determines the long-term attractor basin of intelligence. Using insights from dynamical systems, evolutionary game theory, and network models, it argues that intelligence follows a path-dependent, irreversible trajectory. Once development enters a centralized (AGI-first) or decentralized (DCI-first) regime, transitions become structurally infeasible due to feedback loops and resource lock-in. Intelligence attractors are modeled in functional state space as the co-navigation of conceptual and adaptive fitness spaces. Early-phase structuring constrains later dynamics, much like renormalization in physics. This has major implications for AI safety: traditional alignment assumes AGI will emerge and must be controlled after the fact, but this paper argues that intelligence sequencing is more foundational. If AGI-first architectures dominate before DCI reaches critical mass, hierarchical monopolization and existential risk become locked in. If DCI-first emerges, intelligence stabilizes around decentralized cooperative equilibrium. The paper further explores whether intelligence structurally biases itself toward an attractor based on its self-modeling method—externally imposed axioms (favoring AGI) vs. recursive internal visualization (favoring DCI). Finally, it proposes methods to test this theory via simulations, historical lock-in case studies, and intelligence network analysis. The findings suggest that intelligence sequencing is a civilizational tipping point: determining whether the future is shaped by unbounded competition or unbounded cooperation.

Leon Lang Mar 23, 2025, 6:03 PM
7 points
2
in reply to: Christopher King’s comment on: METR: Measuring AI Ability to Complete Long Tasks
Do you think the x-axis being a release date is more mysterious than the same fact regarding Moore’s law?
(Tbc., I think this doesn’t make it less mysterious: For Moore’s law this also seems like a mystery to me. But this analogy makes it more plausible that there is a mysterious but true reason driving such trends, instead of the graph from METR simply being a weird coincidence. )

Leon Lang Mar 3, 2025, 7:12 PM
5 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Maryna Viazovska, the Ukrainian Fields medalist, did her PhD in Germany under Don Zagier.
---

I once saw at least one French TV talkshow where famous mathematicians were invited (I don’t find the links anymore). Something like that would be pretty much unthinkable in Germany. So I wonder if math has generally more prestige in France than e.g. in Germany.

Leon Lang Mar 2, 2025, 11:11 PM
4 points
0
on: Leon Lang’s Shortform
I’m confused by the order Lesswrong shows posts to me: I’d expect to see them in chronological order if I select them by “Latest”.
But as you see, they were posted 1d, 4d, 21h, etc ago.
How can I see them chronologically?

Leon Lang Feb 18, 2025, 12:30 AM
1 point
0
on: AGI Safety & Alignment @ Google DeepMind is hiring
Thanks for this post!
The deadline possibly requires clarification:
We will keep the application form open until at least 11:59pm AoE on Thursday, February 27.
In the job posting, you write:
Application deadline: 12pm PST Friday 28th February 2025

Leon Lang Jan 16, 2025, 9:45 PM
30 points
4
on: Leon Lang’s Shortform
There are a few sentences in Anthropic’s “conversation with our cofounders” regarding RLHF that I found quite striking:
Dario (2:57): “The whole reason for scaling these models up was that [...] the models weren’t smart enough to do RLHF on top of. [...]”
Chris: “I think there was also an element of, like, the scaling work was done as part of the safety team that Dario started at OpenAI because we thought that forecasting AI trends was important to be able to have us taken seriously and take safety seriously as a problem.”
Dario: “Correct.”
That LLMs were scaled up partially in order to do RLHF on top of them is something I had previously heard from an OpenAI employee, but I wasn’t sure it’s true. This conversation seems to confirm it.

Leon Lang Jan 5, 2025, 8:47 PM
2 points
0
in reply to: cloud’s comment on: [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Hi! Thanks a lot for your comments and very good points. I apologize for my late answer, caused by NeurIPS and all the end-of-year breakdown of routines :)
On 1: Yes, the formalism I’m currently working on also allows to talk about the case that the human “understands less” than the AI.
On 2:
Have you considered the connection between partial observability and state aliasing/function approximation?
I am not entirely sure if I understand! Though if it’s just what you express in the following sentences, here’s my answers:
Maybe you could apply your theory to weak-to-strong generalization by considering a weak model as operating under partial observability.
Very good observation! :) I’m thinking about it slightly differently, but the link is there: Imagine a scenario where we have a pretrained foundation model, and we train a linear probe attached to the internal representations, which is supposed to learn the correct reward for full state sequences, based on feedback from a human on partial observations. Then if we show this model (including attached probe) during training just the partial observations, it’s receiving the correct data and is supposed to generalize from feedback on “easy situations” (i.e., situations where the partial observations of the human provide enough information to make a correct judgment) to “hard situations” (full state sequences that the human couldn’t oversee, and where possibly the partial observations miss crucial details).
So I think this setting is an instance of weak-to-strong generalization.
Alternatively, by introducing structure to the observations, the function approximation lens might open up new angles of attack on the problem.
Yes that’s actually also part of what I’m exploring, if I understand your idea correctly. In particular, I’m considering the case that we may have “knowledge” of some form about the space in which the correct reward function lives. This may come from symmetries in the state space, for example: maybe we want to restrict to localized reward functions that are translation-invariant. All of that can easily be formalized in one framework.
Pretrained foudation models on which we attach a “reward probe” can be viewed as another instance of considering symmetries in the state space: In this case, we’re presuming that state sequences have the same reward if they give rise to the same “learned abstractions” in the form of the internal representations of the neural network.
On 3: Agreed. (Though I am not explicitly considering this case at this point. )
On 4:
I think you’re exactly right to consider abstractions of trajectories, but I’m not convinced this needs to be complicated. What if you considered the case where the problem definition includes features of state trajectories on which (known) human utilities are defined, but these features themselves are not always observed? (This is something I’m currently thinking about, as a generalization of the work mentioned in the postscript.
This actually sounds very much like what I’m working on right now!! We should probably talk :)
On 5:
Am I correct in my understanding that the role Boltzmann rationality plays in your setup is just to get a reward function out of preference data?
If I understand correctly, yes. In a sense, we just “invert” the sigmoid function to recover the return function on observation sequences from human preference data. If this return function on observation sequences was already known, we’d still be doomed, as you correctly point out.
Thanks also for the notes on gradient routing! I will read your post and will try to understand the connection.

Leon Lang Dec 29, 2024, 2:38 PM
4 points
0
on: Leon Lang’s Shortform
This is a link to a big list of LLM safety papers based on a new big survey.

Leon Lang Dec 29, 2024, 9:31 AM
5 points
0
in reply to: leogao’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
Thanks for the list! I have two questions:

1: Can you explain how generalization of NNs relates to ELK? I can see that it can help with ELK (if you know a reporter generalizes, you can train it on labeled situations and apply it more broadly) or make ELK unnecessary (if weak to strong generalization perfectly works and we never need to understand complex scenarios). But I’m not sure if that’s what you mean.

2: How is goodhart robustness relevant? Most models today don’t seem to use reward functions in deployment, and in training the researchers can control how hard they optimize these functions, so I don’t understand why they necessarily need to be robust under strong optimization.

Leon Lang Dec 29, 2024, 12:38 AM
4 points
2
in reply to: Nate Showell’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
“heuristics activated in different contexts” is a very broad prediction. If “heuristics” include reasoning heuristics, then this probably includes highly goal-oriented agents like Hitler.

Also, some heuristics will be more powerful and/or more goal-directed, and those might try to preserve themselves (or sufficiently similar processes) more so than the shallow heuristics. Thus, I think eventually, it is plausible that a superintelligence looks increasingly like a goal-maximizer.

Leon Lang Dec 28, 2024, 4:10 PM
17 points
8
in reply to: evhub’s comment on: evhub’s Shortform
This is a low effort comment in the sense that I don’t quite know what or whether you should do something different along the following lines, and I have substantial uncertainty.

That said:
1. I wonder whether Anthropic is partially responsible for an increased international race through things like Dario advocating for an entente strategy and talking positively about Leopold Aschenbrenner’s “situational awareness”. I wished to see more of an effort to engage with Chinese AI leaders to push for cooperation/coordination. Maybe it’s still possible to course-correct.
2. Alternatively I think that if there’s a way for Anthropic/Dario to communicate why you think an entente strategy is inevitable/desirable, in a way that seems honest and allows to engage with your models of reality, that might also be very helpful for the epistemic health of the whole safety community. I understand that maybe there’s no politically feasible way to communicate honestly about this, but maybe see this as my attempt to nudge you in the direction of openness.
More specifically:

(a) it would help to learn more about your models of how winning the AGI race leads to long-term security (I assume that might require building up a robust military advantage, but given the physical hurdles that Dario himself expects for AGI to effectively act in the world, it’s unclear to me what your model is for how to get that military advantage fast enough after AGI is achieved).

(b) I also wonder whether potential future developments in AI Safety and control might give us information that the transition period is really unsafe; eg., what if you race ahead and then learn that actually you can’t safely scale further due to risks of loss of control? At that point, coordinating with China seems harder than doing it now. I’d like to see a legible justification of your strategy that takes into account such serious possibilities.

Leon Lang Dec 26, 2024, 10:28 PM
7 points
7
in reply to: Buck’s comment on: Why don’t we currently have AI agents?

I have an AI agent that wrote myself

Best typo :D

Leon Lang Dec 26, 2024, 9:20 PM
4 points
0
in reply to: Buck’s comment on: Buck’s Shortform
Have you also tried reviewing for conferences like NeurIPS? I’d be curious what the differences are.
Some people send papers to TMLR when they think they wouldn’t be accepted to the big conferences due to not being that “impactful”—which makes sense since TMLR doesn’t evaluate impact. It’s thus possible that the median TMLR submission is worse than the median conference submission.

Leon Lang Dec 3, 2024, 10:32 PM
12 points
0
on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
I just donated $200. Thanks for everything you’re doing!

Leon Lang Nov 25, 2024, 10:38 PM
2 points
0
in reply to: Cole Wyeth’s comment on: mishka’s Shortform
Yeah I think that’s a valid viewpoint.
Another viewpoint that points in a different direction: A few years ago, LLMs could only do tasks that require humans ~minutes. Now they’re at the ~hours point. So if this metric continues, eventually they’ll do tasks requiring humans days, weeks, months, …
I don’t have good intuitions that would help me to decide which of those viewpoints is better for predicting the future.