habryka

Karma: 42,827

Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.

(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)

habryka Apr 20, 2025, 12:29 AM
2 points
0
in reply to: jacquesthibs’s comment on: What Makes an AI Startup “Net Positive” for Safety?
Ah, I see. I did interpret the framing around “net positive” to be largely around normal companies. It’s IMO relatively easy to be net-positive, since all you need to do is to avoid harm in expectation and help in any way whatsoever, which my guess is almost any technology startup that doesn’t accelerate things, but has reasonable people at the aim, can achieve.
When we are talking more about “how to make a safety-focused company that is substantially positive on safety?”, that is a very different question in my mind.

habryka Apr 19, 2025, 7:27 PM
8 points
2
on: Why Have Sentence Lengths Decreased?
Promoted to curated: I don’t think this post is earth-shattering, but it’s good, short, and answers an interesting question, and does so with a reasonable methodology and curiosity. And it’s not about AI, for once, which is a nice change of pace from our curation schedule these days.

habryka Apr 19, 2025, 6:52 PM
2 points
2
in reply to: Joseph Miller’s comment on: What Makes an AI Startup “Net Positive” for Safety?
I don’t think we should have norms or a culture that requires everything everyone does to be good specifically for AI Safety. Startups are mostly good because they produce economic value and solve problems for people. Sometimes they do so in a way that helps with AI Safety, sometimes not. I think Suno has helped with AI Safety because it has allowed us to make a dope album that made the rationalist culture better. Midjourney has helped us make LW better. But mostly, they just make the world better the same way most companies in history have made the world better.

habryka Apr 18, 2025, 10:34 PM
49 points
11
on: What Makes an AI Startup “Net Positive” for Safety?
I think almost all startups are really great! I think there really is a very small set of startups that end up harmful for the world, usually by specifically making a leveraged bet on trying to create very powerful AGI, or accelerating AI-driven R&D.
Because in some sense expecting a future with both of these technologies is what distinguishes our community from the rest of the world, if you end up optimizing for profit, leaning into exactly those technology then ends up a surprisingly common thing for people to do (as its where the alpha of our community relative to the rest of the world lies), which I do think is really bad.
As a concrete example, I don’t think Elicit is making the world much worse. I think its sign is not super obvious, but I don’t have a feeling they are accelerating timelines very much, or are making takeoff more sharp, or are exhausting political coordination goodwill. Similarly Midjourney I think is good for the world, so is Suno, so are basically all the AI art startups. I do think they might drive investment into the AI sector, and that might draw them into the red, but in as much as we want to have norms, and I give people advice on what to do, working on those things feels like it could definitely be net good.

habryka Apr 18, 2025, 10:26 PM
2 points
0
in reply to: Matthew Barnett’s comment on: jacquesthibs’s Shortform
Ah, yeah, that makes sense. I’ll also edit my comment to make it clear I am talking about the “Epoch” clause, to reduce ambiguity there.

habryka Apr 18, 2025, 9:54 PM
4 points
0
in reply to: Matthew Barnett’s comment on: jacquesthibs’s Shortform
I don’t understand this sentence in that case:
The original comment referenced “Matthew/Tamay/Ege”, yet you quoted Jaime to back up this claim.
But my claim is straightforwardly about the part where it’s not about “Matthew/Tamay/Ege”, but about the part where it says “Epoch”, for which the word of the director seems like the most relevant.
I agree that additionally we could also look at the Matthew/Tamay/Ege clause. I agree that you have been openly critical in many ways, and find your actions here less surprising.

habryka Apr 18, 2025, 9:48 PM
6 points
6
in reply to: Matthew Barnett’s comment on: jacquesthibs’s Shortform
“They” is referring to Epoch as an entity, which the comment referenced directly. My guess is you just missed that?
ha ha but Epoch [...] were never really safety-focused, and certainly not bright-eyed standard-view-holding EAs, I think
Of course the views of the director of Epoch at the time are highly relevant to assessing whether Epoch as an institution was presenting itself as safety focused.

habryka Apr 18, 2025, 6:24 PM
8 points
8
in reply to: Sam Marks’s comment on: jacquesthibs’s Shortform
Huh, by gricean implicature it IMO clearly implies that if there was a strong case that it would increase investment, then it would be a relevant and important consideration. Why bring it up otherwise?
I am really quite confident in my read here. I agree Jaime is not being maximally explicit here, but I would gladly take bets that >80% of random readers who would listen to a conversation like this, or read a comment thread like this, would walk away thinking the author does think that whether AI scaling would increase as a result of this kind of work, is considered relevant and important by Jaime.

habryka Apr 18, 2025, 4:20 PM
7 points
5
in reply to: Pablo Villalobos’s comment on: jacquesthibs’s Shortform
Thank you, that is helpful information.

habryka Apr 18, 2025, 8:21 AM
2 points
2
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
I don’t undertand what it would mean for “outputs” to be corrigible, so I feel like you must be talking about internal chain of thoughts here? The output of a corrigible AI and a non-corrigibile AI is the same for almost all tasks? They both try to perform any task as well as possible, the difference is how they relate to the task and how they handle interference.

habryka Apr 18, 2025, 3:37 AM
7 points
0
in reply to: Zach Stein-Perlman’s comment on: jacquesthibs’s Shortform
This comment suggests it was maybe a shift over the last year or two (but also emphasises that at least Jaime thinks AI risk is still serious): https://www.lesswrong.com/posts/Fhwh67eJDLeaSfHzx/jonathan-claybrough-s-shortform?commentId=X3bLKX3ASvWbkNJkH
I personally take AI risks seriously, and I think they are worth investigating and preparing for.
I have drifted towards a more skeptical position on risk in the last two years. This is due to a combination of seeing the societal reaction to AI, me participating in several risk evaluation processes, and AI unfolding more gradually than I expected 10 years ago.
Currently I am more worried about concentration in AI development and how unimproved humans will retain wealth over the very long term than I am about a violent AI takeover.

habryka Apr 18, 2025, 3:14 AM
49 points
20
in reply to: Zach Stein-Perlman’s comment on: jacquesthibs’s Shortform
Epoch has definitely described itself as safety focused to me and others. And I don’t know man, this back and forth to me sure sounds like they were branding themselves as being safety conscious:
Ofer: Can you describe your meta process for deciding what analyses to work on and how to communicate them? Analyses about the future development of transformative AI can be extremely beneficial (including via publishing them and getting many people more informed). But getting many people more hyped about scaling up ML models, for example, can also be counterproductive. Notably, The Economist article that you linked to shows your work under the title “The blessings of scale”. (I’m not making here a claim that that particular article is net-negative; just that the meta process above is very important.)
Jaime: OBJECT LEVEL REPLY:
Our current publication policy is:
- Any Epoch staff member can object when we announce intention to publish a paper or blogpost.
- We then have a discussion about it. If we conclude that there is a harm and that the harm outweights the benefits we refrain from publishing.
- If no consensus is reached we discuss the issue with some of our trusted partners and seek advice.
- Some of our work that is not published is instead disseminated privately on a case-by-case basis
  We think this policy has a good mix of being flexible and giving space for Epoch staff to raise concerns.
Zach: Out of curiosity, when you “announce intention to publish a paper or blogpost,” how often has a staff member objected in the past, and how often has that led to major changes or not publishing?
Jaime: I recall three in depth conversations about particular Epoch products. None of them led to a substantive change in publication and content.
OTOH I can think of at least three instances where we decided to not pursue projects or we edited some information out of an article guided by considerations like “we may not want to call attention about this topic”.
In general I think we are good at preempting when something might be controversial or could be presented in a less conspicuous framing, and acting on it.
As well as:
Thinking about the ways publications can be harmful is something that I wish was practiced more widely in the world, specially in the field of AI.
That being said, I believe that in EA, and in particular in AI Safety, the pendulum has swung too far—we would benefit from discussing these issues more openly.
In particular, I think that talking about AI scaling is unlikely to goad major companies to invest much more in AI (there are already huge incentives). And I think EAs and people otherwise invested in AI Safety would benefit from having access to the current best guesses of the people who spend more time thinking about the topic.
This does not exempt the responsibility for Epoch and other people working on AI Strategy to be mindful of how their work could result in harm, but I felt it was important to argue for more openness in the margin.
Jaime directly emphasizes how increasing AI investment would be a reasonable and valid complaint about Epoch’s work if it was true! Look, man, if I asked this set of question, got this set of answers, while the real answer is “Yes, we think it’s pretty likely we will use the research we developed at Epoch to launch a long-time-horizon focused RL capability company”, then I sure would feel pissed (and am pissed).
I had conversations with maybe two dozen people evaluating the work of Epoch over the past few months, as well as with Epoch staff, and they were definitely generally assumed to be safety focused (if sometimes from a worldview that is more gradual disempowerment focused). I heard concerns that the leadership didn’t really care about existential risk, but nobody I talked to felt confident in that (though maybe I missed that).

habryka Apr 16, 2025, 7:23 PM
4 points
0
on: LessWrong merch?
We don’t yet, but have considered it a few times. It would be quite surprising if it’s a big source of revenue, so has not been a big priority for us on those grounds, but I do think it would be cool.

habryka Apr 16, 2025, 5:55 AM
2 points
0
in reply to: Archimedes’s comment on: Policy for LLM Writing on LessWrong
You are using the Markdown editor, which many fewer users use. The instructions are correct for the WYSIWYG editor (seems fine to add a footnote explaining the different syntax for Markdown).

habryka Apr 15, 2025, 8:57 PM
2 points
0
in reply to: gwern’s comment on: Policy for LLM Writing on LessWrong
It already has been getting a bunch harder. I am quite confident a lot of new submissions to LW are AI-generated, but the last month or two have made distinguishing them from human writing a lot harder. I still think we are pretty good, but I don’t think we are that many months away from that breaking as well.

habryka Apr 15, 2025, 8:44 PM
7 points
0
on: Power Lies Trembling: a three-book review
Promoted to curated: I quite liked this post. The basic model feels like one I’ve seen explained in a bunch of other places, but I did quite like the graphs and the pedagogical approach taken in this post, and I also think book reviews continue to be one of the best ways to introduce new ideas.

ASI existential risk: Reconsidering Alignment as a Goal

habrykaApr 15, 2025, 7:57 PM

81 points

14 comments19 min readLW link

(michaelnotebook.com)

habryka Apr 15, 2025, 6:35 AM
23 points
2
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in? If so, I am confident you are wrong and that you have learned something new today!
Training transformers in additional languages basically doesn’t really change performance at all, the model just learns to translate between its existing internal latent distribution and the new language, and then just now has a new language it can speak in, with basically no substantial changes in its performance on other tasks (of course, being better at tasks that require speaking in the new foreign language, and maybe a small boost in general task performance because you gave it more data than you had before).
Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren’t shifting the inductive biases of the model on the vast majority of the distribution. This isn’t because of an analogy with evolution, it’s a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the “corrigible language” the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its internal chain of thought that substantially change its performance on actual tasks that you are trying to optimize for.
Interspersing the french data with the rest of its training data won’t change anything either. It again will just learn the language. Giving it more data in french will now just basically do the same as giving it more data in english. The learning is no longer happening at the language level, its happening at the content and world-model level.

habryka Apr 15, 2025, 6:18 AM
17 points
0
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
Things that happen:
1. Backpropagating on the outputs that are “more corrigible” will have some (though mostly very small) impact on your task performance. If you set the learning rate high, or you backpropagate on a lot of data, your performance can go down arbitrarily far.
2. By default this will do very little because you are providing training data with very little variance in it (even less so than usual, because you are training on AI outputs, which the AI is of course already amazing at predicting). If you train very hard you will probably deal with consistent mode collapse. In general, you can’t really train AI systems with any particular bias in your data, because you don’t have enough variation in your data. We can approximately only train AI systems to do one thing, which is to predict the next token from a distributions for which we have trillions of tokens of training data that are hard to predict (which is basically just going to be internet text, audio and video, though more RL-like environments are also feasible now).^[1]
The answer to this is the answer to any question of the form “what if we just generate lots of data with the inductive biases we would like the model to have?”.
The answer is always
- “we can’t generate realistic data with whatever inductive biases we want”, and
- “we can’t remotely generate enough data without dealing with mode collapse”, and
- “we have basically no idea how inductive biases generalize from the training data to the model output, especially as the model starts reflecting on itself and modeling the data generation process” and
- “if you train or backpropagate directly against your discriminator the model will learn to hack the discriminator, (even if the discriminator is the model itself)”,
which are like all the standard alignment failures that have been written about for tens of thousands of pages by now.
At a more detailed level, here is roughly what I actually expect to happen if you do this specific thing, but to be clear, this is a much less confident prediction than the reference class outlined above:
- After a few hundreds of thousands of iterations of this the model will have figured out a set of consistent indicators present whenever it is given past output of itself that it was asked to rephrase to “be more corrigible”
- It becomes quite good at predicting these indicators and quickly saturates on predicting these indicators, resulting in very corrigible looking outputs within the same distribution in which you did this training
- Given that the actual output of the chain of thought changes very little, this results in very little gradients to almost any of its other cognition, the same way that training an AI in an additional language with different grammar and social norms basically doesn’t really change any of its other circuits and cognitions
- If you drastically increase the learning rate or really go very hard on generating enormous amounts of data the model will start to cannibalize other parts of its cognition to perform well at this task, at which generalized task performance will go down, but you probably won’t do this, because it would be very expensive (and also make the model non-competitive)
- Within distribution the model will now probably produce outputs that look “more corrigible”. Again, this will not actually reflect much of a change in its internal cognition, the same way training an AI system on foreign language output will not substantially interfere with its cognition in other languages
- If you train the model on some end-to-end tasks where the corrigibility would end up bad for performance, it will probably very quickly learn to self-prompt itself in a way that causes it to think to itself without the corrigibility biases you introduced into it. This might literally mean switching to a different literal foreign language for its internal thoughts, or something equivalent, but most likely it just means it sounds more like what it previously sounded like before you did this whole thing. If you try to prevent this with continuous monitoring, it will find some way to jailbreak itself or think in some way where the corrigibility language monitoring makes little difference to what cognition is actually going on.
Ending on a slightly more sarcastic note because I do feel a sense of despair about even people who have been thinking about this stuff for a long time seeming to me to not see the logic above:
At the end of the day, very little good will have happened, but you will have a harder time identifying whether the model is thinking non-corrigibly, as you have trained the model against at least one discriminator for deception without training out its deceptive cognition. If you now repeat this for 4-5 similarly shallow alignment approaches you will have produced a model that looks visibly more aligned to anyone who thinks approaches like this might work, without any actual changes to the underlying structure of its cognition, making coordination to actually do anything about the problem much harder.
1. ^
  Or in a more generalized form also applicable to RL environments: We can only train AI systems to be competent, as all scaling laws (and common sense) have shown that competence is approximately the only thing that generalizes between all environments.
  
  We cannot generate environments that teach virtue, because we do not have principles with which we can create the whole complexity of a universe that requires superhuman intelligence to navigate, while also only doing so by thinking in the specific preferred ways that we would like you to think. We do not know how to specify how to solve most problems in virtuous ways, we are barely capable of specifying how to solve them at all, and so cannot build environments consistently rich that chisel virtuous cognition into you.
  The amount of chiseling of cognition any approach like this can achieve is roughly bounded by the difficulty and richness of cognition that your transformation of the data requires to reverse. Your transformation of the data is likely trivial to reverse (i.e. predicting the “corrigible” text from non-corrigible cognition is likely trivially easy especially given that it’s AI generated by our very own model), and as such, practically no chiseling of cognition will occur. If you hope to chisel cognition into AI, you will need to do it with a transformation that is actually hard to reverse, so that you have a gradient into most of the network that is optimized to solve hard problems.

habryka Apr 15, 2025, 1:27 AM
3 points
0
in reply to: Severin T. Seehrich’s comment on: Why does LW not put much more focus on AI governance and outreach?
Yeah, I don’t the classification is super obvious. I categorized all things that had AI policy relevance, even if not directly about AI policy.
23 is IMO also very unambiguously AI policy relevant (ignoring quality). Zvi’s analysis almost always includes lots of AI policy discussion, so I think 33 is also a clear “Yes”. The other ones seem like harder calls.
Sampling all posts also wouldn’t be too hard. My guess is you get something in the 10-20%-ish range of posts by similar standards.

habryka

ASI ex­is­ten­tial risk: Re­con­sid­er­ing Align­ment as a Goal

ASI existential risk: Reconsidering Alignment as a Goal