habryka

Karma: 42,668

Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.

(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)

habryka Apr 16, 2025, 5:55 AM
2 points
0
in reply to: Archimedes’s comment on: Policy for LLM Writing on LessWrong
You are using the Markdown editor, which many fewer users use. The instructions are correct for the WYSIWYG editor (seems fine to add a footnote explaining the different syntax for Markdown).

habryka Apr 15, 2025, 8:57 PM
2 points
0
in reply to: gwern’s comment on: Policy for LLM Writing on LessWrong
It already has been getting a bunch harder. I am quite confident a lot of new submissions to LW are AI-generated, but the last month or two have made distinguishing them from human writing a lot harder. I still think we are pretty good, but I don’t think we are that many months away from that breaking as well.

habryka Apr 15, 2025, 8:44 PM
7 points
0
on: Power Lies Trembling: a three-book review
Promoted to curated: I quite liked this post. The basic model feels like one I’ve seen explained in a bunch of other places, but I did quite like the graphs and the pedagogical approach taken in this post, and I also think book reviews continue to be one of the best ways to introduce new ideas.

ASI existential risk: Reconsidering Alignment as a Goal

habrykaApr 15, 2025, 7:57 PM

70 points

6 comments19 min readLW link

(michaelnotebook.com)

habryka Apr 15, 2025, 6:35 AM
14 points
2
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in? If so, I am confident you are wrong and that you have learned something new today!
Training transformers in additional languages basically doesn’t really change performance at all, the model just learns to translate between its existing internal latent distribution and the new language, and then just now has a new language it can speak in, with basically no substantial changes in its performance on other tasks (of course, being better at tasks that require speaking in the new foreign language, and maybe a small boost in general task performance because you gave it more data than you had before).
Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren’t shifting the inductive biases of the model on the vast majority of the distribution. This isn’t because of an analogy with evolution, it’s a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the “corrigible language” the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its internal chain of thought that substantially change its performance on actual tasks that you are trying to optimize for.
Interspersing the french data with the rest of its training data won’t change anything either. It again will just learn the language. Giving it more data in french will now just basically do the same as giving it more data in english. The learning is no longer happening at the language level, its happening at the content and world-model level.

habryka Apr 15, 2025, 6:18 AM
9 points
0
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
Things that happen:
1. Backpropagating on the outputs that are “more corrigible” will have some (though mostly very small) impact on your task performance. If you set the learning rate high, or you backpropagate on a lot of data, your performance can go down arbitrarily far.
2. By default this will do very little because you are providing training data with very little variance in it (even less so than usual, because you are training on AI outputs, which the AI is of course already amazing at predicting). If you train very hard you will probably deal with consistent mode collapse. In general, you can’t really train AI systems with any particular bias in your data, because you don’t have enough variation in your data. We can approximately only train AI systems to do one thing, which is to predict the next token from a distributions for which we have trillions of tokens of training data that are hard to predict (which is basically just going to be internet text, audio and video, though more RL-like environments are also feasible now).^[1]
The answer to this is the answer to any question of the form “what if we just generate lots of data with the inductive biases we would like the model to have?”.
The answer is always
- “we can’t generate realistic data with whatever inductive biases we want”, and
- “we can’t remotely generate enough data without dealing with mode collapse”, and
- “we have basically no idea how inductive biases generalize from the training data to the model output, especially as the model starts reflecting on itself and modeling the data generation process” and
- “if you train or backpropagate directly against your discriminator the model will learn to hack the discriminator”,
which are like all the standard alignment failures that have been written about for tens of thousands of pages by now.
At a more detailed level, here is roughly what I actually expect to happen if you do this specific thing, but to be clear, this is a much less confident prediction than the reference class outlined above:
- After a few hundreds of thousands of iterations of this the model will have figured out a set of consistent indicators present whenever it is given past output of itself that it was asked to rephrase to “be more corrigible”
- It becomes quite good at predicting these indicators and quickly saturates on predicting these indicators, resulting in very corrigible looking outputs within the same distribution in which you did this training
- Given that the actual output of the chain of thought changes very little, this results in very little gradients to almost any of its other cognition, the same way that training an AI in an additional language with different grammar and social norms basically doesn’t really change any of its other circuits and cognitions
- If you drastically increase the learning rate or really go very hard on generating enormous amounts of data the model will start to cannibalize other parts of its cognition to perform well at this task, at which generalized task performance will go down, but you probably won’t do this, because it would be very expensive (and also make the model non-competitive)
- Within distribution the model will now probably produce outputs that look “more corrigible”. Again, this will not actually reflect much of a change in its internal cognition, the same way training an AI system on foreign language output will not substantially interfere with its cognition in other languages
- If you train the model on some end-to-end tasks where the corrigibility would end up bad for performance, it will probably very quickly learn to self-prompt itself in a way that causes it to think to itself without the corrigibility biases you introduced into it. This might literally mean switching to a different literal foreign language for its internal thoughts, or something equivalent, but most likely it just means it sounds more like what it previously sounded like before you did this whole thing. If you try to prevent this with continuous monitoring, it will find some way to jailbreak itself or think in some way where the corrigibility language monitoring makes little difference to what cognition is actually going on.
Ending on a slightly more sarcastic note because I do feel a sense of despair about even people who have been thinking about this stuff for a long time seeming to me to not see the logic above:
At the end of the day, very little good will have happened, but you will have a harder time identifying whether the model is thinking non-corrigibly, as you have trained the model against at least one discriminator for deception without training out its deceptive cognition. If you now repeat this for 4-5 similarly shallow alignment approaches you will have produced a model that looks visibly more aligned to anyone who thinks approaches like this might work, without any actual changes to the underlying structure of its cognition, making coordination to actually do anything about the problem much harder.
1. ^
  Or in a more generalized form also applicable to RL environments: We can only train AI systems to be competent, as all scaling laws (and common sense) have shown that competence is approximately the only thing that generalizes between all environments.
  
  We cannot generate environments that teach virtue, because we do not have principles with which we can create the whole complexity of a universe that requires superhuman intelligence to navigate, while also only doing so by thinking in the specific preferred ways that we would like you to think. We do not know how to specify how to solve most problems in virtuous ways, we are barely capable of specifying how to solve them at all, and so cannot build environments consistently rich that chisel virtuous cognition into you.
  The amount of chiseling of cognition any approach like this can achieve is roughly bounded by the difficulty and richness of cognition that your transformation of the data requires to reverse. Your transformation of the data is likely trivial to reverse (i.e. predicting the “corrigible” text from non-corrigible cognition is likely trivially easy especially given that it’s AI generated by our very own model), and as such, practically no chiseling of cognition will occur. If you hope to chisel cognition into AI, you will need to do it with a transformation that is actually hard to reverse, so that you have a gradient into most of the network that is optimized to solve hard problems.

habryka Apr 15, 2025, 1:27 AM
3 points
0
in reply to: Severin T. Seehrich’s comment on: Why does LW not put much more focus on AI governance and outreach?
Yeah, I don’t the classification is super obvious. I categorized all things that had AI policy relevance, even if not directly about AI policy.
23 is IMO also very unambiguously AI policy relevant (ignoring quality). Zvi’s analysis almost always includes lots of AI policy discussion, so I think 33 is also a clear “Yes”. The other ones seem like harder calls.
Sampling all posts also wouldn’t be too hard. My guess is you get something in the 10-20%-ish range of posts by similar standards.

habryka Apr 14, 2025, 10:46 PM
3 points
1
in reply to: MichaelDickens’s comment on: METR: Measuring AI Ability to Complete Long Tasks
Maybe you switched to the Markdown editor at some point. It still works in the (default) WYSIWYG editor.

habryka Apr 14, 2025, 9:30 PM
14 points
12
on: Why does LW not put much more focus on AI governance and outreach?
IDK, like 15% of submissions to LW seem currently AI governance and outreach focused. I don’t really get the premise of the post. It’s currently probably tied for the most popular post category on the site (my guess is a bit behind more technical AI Alignment work, but not by that much).
Here is a quick spreadsheet where I classify submissions in the “AI” tag by whether they are substantially about AI Governance:
https://docs.google.com/spreadsheets/d/1M5HWeH8wx7lUdHDq_FGtNcA0blbWW4798Cs5s54IGww/edit?usp=sharing
Overall results: 34% of submissions under the AI tag seem pretty clearly relevant to AI Governane or are about AI governance, and additional 12% are “maybe”. In other words, there really is a lot of AI Governance discussion on LW.

habryka Apr 14, 2025, 12:06 AM
16 points
2
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Solar + wind has made a huge dent in energy production, so I feel like this example is confused.
It does seem like this strategy just really worked quite well, and a combination of battery progress and solar would probably by-default replace much of fossil-fuel production in the long run. It already has to a quite substantial degree:
Hydro + Solar + Wind + other renewables has grown to something like 40% of total energy production (edit: in the EU, which feels like the most reasonable reference class for whether this worked).

habryka Apr 13, 2025, 8:38 PM
4 points
2
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Domains with high skill ceilings are quite rare.
Success in almost every domain is strongly correlated with g, including into the tails. This IMO relatively clearly shows that most domains are high skill-ceiling domains (and also that skills in most domains are correlated and share a lot of structure).

habryka Apr 13, 2025, 8:36 PM
13 points
13
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
This reminds me of a conversation I had recently about whether the concept of “evil” is useful. I was arguing that I found “evil”/”corruption” helpful as a handle for a more model-free “move away from this kind of thing even if you can’t predict how exactly it would be bad” relationship to a thing, which I found hard to express in a more consequentialist frames.

habryka Apr 12, 2025, 7:24 AM
LW: 9 AF: 7
3
AF
on: Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Promoted to curated: I really liked this post for its combination of reporting negative results, communicating a deeper shift in response to those negative results, while seeming pretty well-calibrated about the extent of the update. I would have already been excited about curating this post without the latter, but it felt like an additional good reason.

habryka Apr 8, 2025, 6:10 PM
19 points
6
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
I think for many years there was a lot of frustration from people outside of the community about people inside of it not going into a lot of detail. My guess is we are dealing with a bit of a pendulum swing of now people going hard in the other direction. I do think we are just currently dealing with one big wave of this kind of content. It’s not like there was much of any specific detailed scenario work two years ago.

habryka Apr 7, 2025, 3:16 AM
4 points
−20
in reply to: kndjckt’s comment on: AI 2027: What Superintelligence Looks Like
My best guess is you are a few months behind in your takes? The latest generation of thinking models can definitely do agentic frontend development and build small projects autonomously. It definitely still makes errors, and has blindspots that require human supervision, but in terms of skill level, the systems feel definitely comparable and usually superior to junior programmers (but when they fail, they fail hard and worse than real junior programmers).

habryka Apr 3, 2025, 3:12 AM
14 points
0
in reply to: G Wood’s comment on: LessWrong has been acquired by EA
I am planning to make an announcement post for the new album in the next few days, maybe next week. The songs yesterday were early previews and we still have some edits to make before it’s ready!

habryka Apr 2, 2025, 7:15 PM
62 points
0
on: Habryka’s Shortform Feed
Context: LessWrong has been acquired by EA
Goodbye EA. I am sorry we messed up.
EA has decided to not go ahead with their acquisition of LessWrong.
Just before midnight last night, the Lightcone Infrastructure board presented me with information suggesting at least one of our external software contractors has not been consistently candid with the board and me. Today I have learned EA has fully pulled out of the deal.
As soon as EA had sent over their first truckload of cash, we used that money to hire a set of external software contractors, vetted by the most agentic and advanced resume review AI system that we could hack together.
We also used it to launch the biggest prize the rationality community has seen, a true search for the kwisatz haderach of rationality. $1M dollars for the first person to master all twelve virtues.
Unfortunately, it appears that one of the software contractors we hired inserted a backdoor into our code, preventing anyone except themselves and participants excluded from receiving the prize money from collecting the final virtue, “The void”. Some participants even saw themselves winning this virtue, but the backdoor prevented them mastering this final and most crucial rationality virtue at the last possible second.
They then created an alternative account, using their backdoor to master all twelve virtues in seconds. As soon as our fully automated prize systems sent over the money, they cut off all contact.
Right after EA learned of this development, they pulled out of the deal. We immediately removed all code written by the software contractor in question from our codebase. They were honestly extremely productive, and it will probably take us years to make up for this loss. We will also be rolling back any karma changes and reset the vote strength of all votes cast in the last 24 hours, since while we are confident that if our system had worked our karma system would have been greatly improved, the risk of further backdoors and hidden accounts is too big.
We will not be refunding the enormous boatloads of cash^[1] we were making in the sale of Picolightcones, as I am assuming you all read our sale agreement carefully^[2], but do reach out and ask us for a refund if you want.
Thank you all for a great 24 hours though. It was nice while it lasted.
1. ^
  $280! Especially great thanks to the great whale who spent a whole $25.
2. ^
  IMPORTANT Purchasing microtransactions from Lightcone Infrastructure is a high-risk indulgence. It would be wise to view any such purchase from Lightcone Infrastructure in the spirit of a donation, with the understanding that it may be difficult to know what role custom LessWrong themes will play in a post-AGI world. LIGHTCONE PROVIDES ABSOLUTELY NO LONG-TERM GUARANTEES THAT ANY SERVICES SUCH RENDERED WILL LAST LONGER THAN 24 HOURS.
What links here?
- LessWrong has been acquired by EA by habryka (Apr 1, 2025, 1:09 PM; 340 points)

habryka Apr 2, 2025, 2:00 AM
10 points
0
on: LessWrong has been acquired by EA
You can now choose which virtues you want to display next to your username! Just go to the virtues dialogue on the frontpage and select the ones you want to display (up to 3).

habryka Apr 2, 2025, 12:40 AM
14 points
0
in reply to: leogao’s comment on: LessWrong has been acquired by EA
Absolutely, that is our sole motivation.

habryka Apr 1, 2025, 9:41 PM
10 points
2
in reply to: lc’s comment on: LessWrong has been acquired by EA

habryka

ASI ex­is­ten­tial risk: Re­con­sid­er­ing Align­ment as a Goal

ASI existential risk: Reconsidering Alignment as a Goal