Joel Burget

Karma: 764

I write software at survivalandflourishing.com. Previously MATS, Google, Khan Academy.

Joel Burget Jul 7, 2025, 3:08 AM
1 point
0
on: Foom & Doom 2: Technical alignment is hard
in brain-like AGI, the reward function is written in Python (or whatever), not in natural language
I think a good reward function for brain-like AGI will basically look kinda like legible Python code, not like an inscrutable trained classifier. We just need to think really hard about what that code should be!
Huh! I would have assumed that this Python would be impossible to get right, because it would necessarily be very long, and how can you verify that it’s correct(?), and you’ll probably want to deal with natural language concepts as opposed to concepts which are easy to define in Python.
Asking an LLM to judge, on the other hand… As you said, Claude is nice and seems to have pretty good judgement. LLMs are good at interpreting long legalistic rules. It’s much harder to game a specification when there is a judge without hardcoded rules, and with the ability to interpret whether some action is in the right spirit or not.

Joel Burget Feb 10, 2025, 3:42 AM
6 points
0
in reply to: Steven Byrnes’s comment on: “Sharp Left Turn” discourse: An opinionated review
Thanks for your patience: I do think this message makes your point clearly. However, I’m sorry to say, I still don’t think I was missing the point. I reviewed §1.5, still believe I understand the open-ended autonomous learning distribution shift, and also find it scary. I also reviewed §3.7, and found it to basically match my model, especially this bit:
Or, of course, it might be more gradual than literally a single run with a better setup. Hard to say for sure. My money would be on “more gradual than literally a single run”, but my cynical expectation is that the (maybe a couple years of) transition time will be squandered
Overall, I don’t have the impression we disagree too much. My guess for what’s going on (and it’s my fault) is that my initial comment’s focus on scaling was not a reaction to anything you said in your post, in fact you didn’t say much about scaling at all. It was more a response to the scaling discussion I see elsewhere.

Joel Burget Feb 2, 2025, 4:27 AM
LW: 8 AF: 3
1
AF
in reply to: Steven Byrnes’s comment on: “Sharp Left Turn” discourse: An opinionated review
For (2), I’m gonna uncharitably rephrase your point as saying: “There hasn’t been a sharp left turn yet, and therefore I’m overall optimistic there will never be a sharp left turn in the future.” Right?
Hm, I wouldn’t have phrased it that way. Point (2) says nothing about the probability of there being a “left turn”, just the speed at which it would happen. When I hear “sharp left turn”, I picture something getting out of control overnight, so it’s useful to contextualize how much compute you have to put in to get performance out, since this suggests that (inasmuch as it’s driven by compute) capabilities ought to grow gradually.
I feel like you’re disagreeing with one of the main arguments of this post without engaging it.
I didn’t mean to disagree with anything in your post, just to add a couple points which I didn’t think were addressed.
You’re right that point (2) wasn’t engaging with the (1-3) triad, because it wasn’t mean to. It’s only about the rate of growth of capabilities (which is important because if each subsequent model is only 10% more capable than the one which came before then there’s good reason to think that alignment techniques which work well on current models will also work on subsequent models).
Again, the big claim of this post is that the sharp left turn has not happened yet. We can and should argue about whether we should feel optimistic or pessimistic about those “wrenching distribution shifts”, but those arguments are as yet untested, i.e. they cannot be resolved by observing today’s pre-sharp-left-turn LLMs. See what I mean?
I do see, and I think this gets at the difference in our (world) models. In a world where there’s a real discontinuity, you’re right, you can’t say much about a post-sharp-turn LLM. In a world where there’s continuous progress, like I mentioned above, I’d be surprised if a “left turn” suddenly appeared without any warning.

Joel Burget Feb 1, 2025, 5:42 PM
LW: 27 AF: 8
−3
AF
on: “Sharp Left Turn” discourse: An opinionated review
I like this post but I think it misses / barely covers two of the most important cases for optimism.
1. Detail of specification
Frontier LLMs have a very good understanding of humans, and seem to model them as well as or even better than other humans. I recall seeing repeated reports of Claude understanding its interlocutor faster than they thought was possible, as if it just “gets” them, e.g. from one Reddit thread I quickly found:
- “sometimes, when i’m tired, i type some lousy prompts, full of typos, incomplete info etc, but Claude still gets me, on a deep fucking level”
- “The ability of how Claude AI capture your intentions behind your questions is truly remarkable. Sometimes perhaps you’re being vague or something, but it will still get you.”
- “even with new chats, it still fills in the gaps and understands my intention”
LLMs have presumably been trained on:
- millions of anecdotes from the internet, including how the author felt, other users’ reactions and commentary, etc.
- case law: how did humans chosen for their wisdom (judges) determine what was right and wrong
- thousands of philosophy books
- Lesswrong / Alignment Forum, with extensive debate on what would be right and wrong for AIs to do
There are also techniques like deliberative alignment, which includes an explicit specification for how AIs should behave. I don’t think the model spec is currently detailed enough but I assume OpenAI intend to actively update it.
Compare this to the “specification” humans are given by your Ev character: some basic desires for food, comfort, etc. Our desires are very crude, confusing, and inconsistent; and only very roughly correlate with IGF. It’s hard to emphasize enough how much more detailed is the specification that we present to AI models.
2. (Somewhat) Gradual Scaling
Toby Ord estimates that pretraining “compute required scales as the 20th power of the desired accuracy”. He estimates that inference scaling is even more expensive, requiring exponentially more compute just to make constant progress. Both of these trends suggest that, even with large investments, performance will increase slowly from hardware alone (this relies on the assumption that hardware performance / $ is increasing slowly, which seems empirically justified). Progress could be faster if big algorithmic improvements are found. In particular I want to call out that recursive-self improvement (especially without a human in the loop) could blow up this argument (which is why I wish it was banned). Still, I’m overall optimistic that capabilities will scale fairly smoothly / predictably.
With (1) and (2) combined, we’re able to gain some experience with each successive generation of models, and add anything we find is missing from the training dataset / model spec, without taking any leaps that are too big / dangerous. I don’t want to suggest that the scaling up while maintaining alignment process will definitely succeed, just that we should update towards the optimistic view based on these arguments.

Joel Burget Jan 27, 2025, 1:36 AM
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Six Thoughts on AI Safety
scale up to superintelligence in parallel across many different projects / nations / factions, such that the power is distributed
This has always struck me as worryingly unstable. ETA: Because in this regime you’re incentivized to pursue reckless behaviour to outcompete the other AIs, e.g. recursive self-improvement.
Is there a good post out there making a case for why this would work? A few possibilities:
- The AIs are all relatively good / aligned. But they could be outcompeted by malevolent AIs. I guess this is what you’re getting at with “most of the ASIs are aligned at any given time”, so they can band together and defend against the bad AIs?
- They all decide / understand that conflict is more costly than cooperation. A darker variation on this is mutually assured destruction, which I don’t find especially comforting to live under.
- Some technological solution to binding / unbreakable contracts such that reneging on your commitments is extremely costly.

Joel Burget Jan 18, 2025, 3:55 PM
3 points
0
on: Renormalization Redux: QFT Techniques for AI Interpretability
this and this.
Both link to the same PDF.

Joel Burget Jan 6, 2025, 8:43 PM
3 points
0
in reply to: quila’s comment on: quila’s Shortform
I’m fine. Don’t worry to much about this. It just made me think, what am I doing here? For someone to single out my question and say “it’s dumb to even ask such a thing” (and the community apparently agrees)… I just think I’ll be better off not spending time here.

Joel Burget Jan 6, 2025, 3:41 PM
8 points
0
in reply to: quila’s comment on: quila’s Shortform
1. My question specifically asks about the transition to ASI, which, while I think it’s really hard to predict, seems likely to take years, during which time we have intelligences just a bit above human level, before they’re truly world-changingly superintelligent. I understand this isn’t everyone’s model, and it’s not necessarily mine, but I think it is plausible.
2. Asking “how could someone ask such a dumb question?” is a great way to ensure they leave the community. (Maybe you think that’s a good thing?)
What links here?
- quila's comment on quila’s Shortform by quila (Jan 4, 2025, 8:11 PM; 49 points)

Joel Burget Jan 2, 2025, 4:59 PM
3 points
0
in reply to: Satron’s comment on: Economic Post-ASI Transition
I should have included this in my list from the start. I basically agree with @Seth Herd that this is a promising direction but I’m concerned about the damage that could occur during takeoff, which could be a years-long period.

Economic Post-ASI Transition

Joel BurgetJan 1, 2025, 10:37 PM

18 points

11 comments1 min readLW link

Joel Burget Dec 31, 2024, 11:15 PM
0 points
−1
on: Joel Burget’s Shortform
Pandora’s box is a much better analogy for AI risk, nuclear energy / weapons, fossil fuels, and bioengineering than it was for anything in the ancient world. Nobody believes in Greek mythology these days but if anyone still did they’d surely use it as a reason that you should believe their religion.

Joel Burget Dec 27, 2024, 4:14 PM
2 points
1
on: The Field of AI Alignment: A Postmortem, and What To Do About It
A different way to think about types of work is within current ML paradigms vs outside of them. If you believe that timelines are short (e.g. 5 years or less), it makes much more sense to work within current paradigms, otherwise there’s very little chance your work will become adopted in time to matter. Mainstream AI, with all of its momentum, is not going to adopt a new paradigm overnight.
If I understand you correctly, there’s a close (but not exact) correspondence between work I’d label in-paradigm and work you’d label as “streetlighting”. On my model the best reason to work in-paradigm is because that’s where your work has a realistic chance to make a difference in this world.
So I think it’s actually good to have a portfolio of projects (maybe not unlike the current mix), from moonshots to very prosaic approaches.

Joel Burget Dec 25, 2024, 4:26 AM
1 point
0
in reply to: boazbarak’s comment on: o3
Hi Boaz, first let me say that I really like Deliberative Alignment. Introducing a system 2 element is great, not only for higher-quality reasoning, but also for producing a legible, auditable chain of though. That said, I have a couple questions I’m hoping you might be able to answer.
1. I read through the model spec (which DA uses, or at least a closely-related spec). It seems well-suited and fairly comprehensive for answering user questions, but not sufficient for a model acting as an agent (which I expect to see more and more). An agent acting in the real world might face all sorts of interesting situations that the spec doesn’t provide guidance on. I can provide some examples if necessary.
2. Does the spec fed to models ever change depending on the country / jurisdiction that the model’s data center or the user are located in? Situations which are normal in some places may be legal in others. For example, Google tells me that homosexuality is illegal in 64 countries. Other situations are more subtle and may reflect different cultures / norms.

Joel Burget Dec 21, 2024, 2:48 PM
1 point
0
in reply to: Taleuntum’s comment on: o3
It’s hard to compare across domains but isn’t the FrontierMath result similarly impressive?

Joel Burget Dec 19, 2024, 5:21 PM
2 points
0
in reply to: evhub’s comment on: Alignment Faking in Large Language Models
Scott Alexander says the deployment behavior is because the model learned “give evil answers while thinking up clever reasons that it was for the greater good” rather than “give evil answers honestly”. To what degree do you endorse this interpretation?

Joel Burget 2 Dec 2024 15:14 UTC
8 points
3
in reply to: habryka’s comment on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
Thanks for this! I just doubled my donation because of this answer and @kave’s.
FWIW a lot of my understanding that Lighthaven was a burden comes from this section:
I initially read this as $3m for three interest payments. (Maybe change the wording so 2 and 3 don’t both mention the interest payment?)

Joel Burget 1 Dec 2024 18:16 UTC
18 points
0
on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
I donated $500. I get a lot of value from the website and think it’s important for both the rationalist and AI safety communities. Two related things prevented me from donating more:
1. Though it’s the website which I find important, as I understand it, the majority of this money will go towards supporting Lighthaven.
  1. I could easily imagine, if I were currently in Berkeley, finding Lighthaven more important. My guess is that in general folks in Berkeley / the Bay Area will tend to value Lighthaven more highly than folks elsewhere. Whether this is because of Berkeley folks overvaluing it or the rest of us undervaluing, I’m not sure. Probably a bit of both.
  2. To me, this suggests unbundling the two rather different activities.
2. Sustainability going forward. It’s not clear to me that Lightcone is financially sustainable, in fact the numbers in this post make it look like it’s not (due to the loss of funders), barring some very large donations. I worry that the future of LW will be endangered by the financial burden of Lighthaven.
  1. ETA: On reflection, I think some large donors will probably step in to prevent bankruptcy, though (a) I think there’s a good chance Lightcone will then be stuck in perpetual fundraising mode, and (b) that belief of course calls into question the value of smaller donations like mine.

Joel Burget 21 Nov 2024 16:33 UTC
2 points
0
in reply to: David Matolcsi’s comment on: DeepSeek beats o1 on math and ties on coding; will release weights
though with an occasional Chinese character once in a while
The Chinese characters sound potentially worrying. Do they make sense in context? I tried a few questions but didn’t see any myself.

Joel Burget 7 Oct 2024 0:10 UTC
1 point
0
in reply to: quetzal_rainbow’s comment on: the case for CoT unfaithfulness is overstated
There are now two alleged instances of full chains of thought leaking (use an appropriate amount of spepticism), both of which seem coherent enough.

Joel Burget 1 Oct 2024 16:17 UTC
1 point
0
in reply to: quetzal_rainbow’s comment on: the case for CoT unfaithfulness is overstated
I think it’s more likely that this is just a (non-model) bug in ChatGPT. In the examples you gave, it looks like there’s always one step that comes completely out of nowhere and the rest of the chain of though would make sense without it. This reminds me of the bug where ChatGPT would show other users’ conversations.

Joel Burget

1. Detail of specification

2. (Somewhat) Gradual Scaling

Eco­nomic Post-ASI Transition

Economic Post-ASI Transition