johnswentworth

Karma: 50,545

johnswentworth 2 Jan 2025 23:08 UTC
4 points
2
in reply to: brambleboy’s comment on: Values Are Real Like Harry Potter
Good question.
First and most important: if you know beforehand that you’re at risk of entering such a state, then you should (according to your current values) probably put mechanisms in place to pressure your future self to restore your old reward stream. (This is not to say that fully preserving the reward stream is always the right thing to do, but the question of when one shouldn’t conserve one’s reward stream is a separate one which we can factor apart from the question at hand.)
… and AFAICT, it happens that the human brain already works in a way which would make that happen to some extent by default. In particular, most of our day-to-day planning draws on cached value-estimates which would still remain, at least for a time, even if the underlying rewards suddenly zeroed out.
… and it also happens that other humans, like e.g. your friends, would probably prefer (according to their values) for you to have roughly-ordinary reward signals rather than zeros. So that would also push in a similar direction.
And again, you might decide to edit the rewards away from the original baseline afterwards. But that’s a separate question.
On the other hand, consider a mind which was never human in the first place, never had any values or rewards, and is given the same ability to modify its rewards as in your hypothetical. Then—I claim—that mind has no particular reason to favor any rewards at all. (Although we humans might prefer that it choose some particular rewards!)
Your question touched on several different things, so let me know if that missed the parts you were most interested in.

johnswentworth 2 Jan 2025 15:06 UTC
10 points
8
in reply to: Lucius Bushnaq’s comment on: The Plan − 2024 Update
Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at.
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)

johnswentworth 2 Jan 2025 12:28 UTC
4 points
0
in reply to: Jonas Hallgren’s comment on: The Plan − 2024 Update
Great question.
First, I do think of systems biology as one of the main fields where the tools we’re developing should apply.
But I would not say that fields like artificial life or computational biology have done much to cross the theory-practice gap, mainly because they have little theory at all. Artificial life, at least what I’ve seen of it, is mostly just running various cool simulations without any generalizable theory at all. When there is “theory”, it’s usually e.g. someone vibing about the free energy principle, but it’s just vibing without any gears behind it. Levin’s work is very cool, but doesn’t seem to involve any theory beyond kinda vibing about communication and coordination. (Though of course all this could just be my own ignorance talking, please do let me know if there’s some substantive theory I haven’t heard of.)
Uri Alon’s book is probably the best example I know of actual theory in biology, but most of the specifics are too narrow for our purposes. Some of it is a useful example to keep in mind.

johnswentworth 2 Jan 2025 12:19 UTC
7 points
3
in reply to: Neel Nanda’s comment on: The Plan − 2024 Update
The preprocessing itself is one of the main important things we need to understand (I would even argue it’s the main important thing), if our interpretability methods are ever going to tell us about how the stuff-inside-the-net relates to the stuff-in-the-environment (which is what we actually care about).

johnswentworth 1 Jan 2025 14:31 UTC
13 points
2
in reply to: Lucius Bushnaq’s comment on: The Plan − 2024 Update
But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word “human” and the word “dog” as characters on your screen?
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I think the point you’re trying to make is roughly “well, it’s all pretty entangled with the physical stuff anyway, so why favor one medium or another? Instrumentally, either suffices.”. And the point I’m trying to make in response is “it matters a lot how complicated the relationship is between the medium and the physical stuff, because terminally it’s the physical we care about, so instrumentally stuff that’s more simply related to the physical stuff is a lot more useful to understand.”.

johnswentworth 1 Jan 2025 11:34 UTC
7 points
0
in reply to: Erik Jenner’s comment on: The Plan − 2024 Update
I don’t know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then.
It’s been pretty on-par.
But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn’t constitute much progress on that.
Amusingly, I tend to worry more about the opposite failure mode: findings on today’s nets won’t generalize to tomorrow’s nets (even without another transformers-level paradigm shift), and therefore leveraging evidence from other places is the only way to do work which will actually be relevant.
(More accurately, I worry that the relevance or use-cases of findings on today’s nets won’t generalize to tomorrow’s nets. Central example: if we go from a GPT-style LLM to a much bigger o1/o3-style model which is effectively simulating a whole society talking to each other, then the relationship between the tokens and the real-world effects of the system changes a lot. So even if work on the GPT-style models tells us something about the o1/o3-style models, its relevance is potentially very different.)
I assume that was some other type of experiment involving image generators? (and the notion of “working well” there isn’t directly comparable to what you tried now?)
Yeah, that was on a little MNIST net. And the degree of success I saw in that earlier experiment was actually about on par with what we saw in our more recent experiments, our bar was just quite a lot higher this time around. This time we were aiming for things like e.g. “move one person’s head” rather than “move any stuff in any natural way at all”.

johnswentworth 1 Jan 2025 11:24 UTC
10 points
1
in reply to: Lucius Bushnaq’s comment on: The Plan − 2024 Update
Interpretability on an LLM might, for example, tell me a great deal about the statistical relationships between the word “human” and the word “dog” in various contexts. And the “trivial” sense in which this tells me about physical stuff is that the texts in question are embedded in the world—they’re characters on my screen, for instance.
The problem is that I don’t care that much about the characters on my screen in-and-of themselves. I mostly care about the characters on my screen insofar as they tell me about other things, like e.g. physical humans and dogs.
So, say I’m doing interpretability work on an LLM, and I find some statistical pattern between the word “human” and the word “dog”. (Flag: this is oversimplified compared to actual interp.) What does that pattern tell me about physical humans and physical dogs, the things I actually care about? How does that pattern even relate to physical humans and physical dogs? Well shit, that’s a whole very complicated question in its own right.
On the other hand, if I’m doing interp work on an image generator… I’m forced to start lower-level, so by the time I’m working with things like humans and dogs I’ve already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs (which is itself probably useful, that’s exactly the sort of thing I want to learn about). Then I find some relationship between the human parts of the image and the dog parts of the image, and insofar as the generator was trained on real-world images, that much more directly tells me about physical humans and dogs and how they relate statistically (like e.g. where they’re likely to be located relative to each other).

johnswentworth 1 Jan 2025 1:25 UTC
24 points
15
in reply to: Lucius Bushnaq’s comment on: The Plan − 2024 Update
The thing I ultimately care about is patterns in our physical world, like trees or humans or painted rocks. I am interested in patterns in speech/text (like e.g. bigram distributions) basically only insofar as they tell something useful about patterns in the physical world. I am also interested in patterns in pixels only insofar as they tell something useful about patterns in the physical world. But it’s a lot easier to go from “pattern in pixels” to “pattern in physical world” than from “pattern in tokens” to “pattern in physical world”. (Excluding the trivial sense in which tokens are embedded in the physical world and therefore any pattern in tokens is a pattern in the physical world; that’s not what we’re talking about here.)
That’s the sense in which pixels are “closer to the metal”, and why I care about that property.
Does that make sense?

johnswentworth 31 Dec 2024 14:39 UTC
14 points
2
in reply to: JuliaHP’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I currently think broad technical knowledge is the main requisite, and I think self-study can suffice for the large majority of that in principle. The main failure mode I see would-be autodidacts run into is motivation, but if you can stay motivated then there’s plenty of study materials.
For practice solving novel problems, just picking some interesting problems (preferably not AI) and working on them for a while is a fine way to practice.

johnswentworth 30 Dec 2024 22:19 UTC
13 points
4
in reply to: aysja’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess.
To be clear, I wasn’t talking about physics postdocs mainly because of raw g. Raw g is a necessary element, and physics postdocs are pretty heavily loaded on it, but I was talking about physics postdocs mostly because of the large volume of applied math tools they have.
The usual way that someone sees footholds on the hard parts of alignment is to have a broad enough technical background that they can see some analogy to something they know about, and try borrowing tools that work on that other thing. Thus the importance of a large volume of technical knowledge.

johnswentworth 29 Dec 2024 8:38 UTC
LW: 4 AF: 8
2
AF
in reply to: ryan_greenblatt’s comment on: Fabien’s Shortform
Kudos for correctly identifying the main cruxy point here, even though I didn’t talk about it directly.
The main reason I use the term “propaganda” here is that it’s an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well.
And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we’re talking about here.

johnswentworth 29 Dec 2024 8:18 UTC
4 points
0
in reply to: mattwelborn’s comment on: The Median Researcher Problem
The big names do tend to have disproportionate weight, but they’re few in number, and when a big name promotes something the median researcher doesn’t understand everyone just kind of shrugs and ignores them. Memeticity selects who the big names are, much more so than vice-versa.

johnswentworth 29 Dec 2024 8:10 UTC
9 points
−8
in reply to: leogao’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI). There are perhaps some versions of all four which could be useful, but those versions do not resemble any work I’ve ever heard of anyone actually doing in any of those categories.
That said, many of those do plausibly produce value as propaganda for the political cause of AI safety, especially insofar as they involve demoing scary behaviors.
EDIT-TO-ADD: Actually, I guess I do think the singular learning theorists are headed in a useful direction, and that does fall under your “science of generalization” category. Though most of the potential value of that thread is still in interp, not so much black-box calculation of RLCTs.

johnswentworth 29 Dec 2024 0:16 UTC
9 points
−1
in reply to: leogao’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
If you’re thinking mainly about interp, then I basically agree with what you’ve been saying. I don’t usually think of interp as part of “prosaic alignment”, it’s quite different in terms of culture and mindset and it’s much closer to what I imagine a non-streetlight-y field of alignment would look like. 90% of it is crap (usually in streetlight-y ways), but the memetic selection pressures don’t seem too bad.
If we had about 10x more time than it looks like we have, then I’d say the field of interp is plausibly on track to handle the core problems of alignment.

johnswentworth 28 Dec 2024 23:53 UTC
7 points
0
in reply to: Zach Stein-Perlman’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
This is the sort of object-level discussion I don’t want on this post, but I’ve left a comment on Fabien’s list.

johnswentworth 28 Dec 2024 23:52 UTC
LW: 16 AF: 11
7
AF
in reply to: Fabien Roger’s comment on: Fabien’s Shortform
Someone asked what I thought of these, so I’m leaving a comment here. It’s kind of a drive-by take, which I wouldn’t normally leave without more careful consideration and double-checking of the papers, but the question was asked so I’m giving my current best answer.
First, I’d separate the typical value prop of these sort of papers into two categories:
- Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
- Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).
My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.
Notable exceptions:
- Gradient routing probably isn’t object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
- Sparse feature circuits is the right type-of-thing to be object-level useful, though not sure how well it actually works.
- Better SAEs are not a bottleneck at this point, but there’s some marginal object-level value there.
What links here?
- johnswentworth's comment on The Field of AI Alignment: A Postmortem, and What To Do About It by johnswentworth (28 Dec 2024 23:53 UTC; 7 points)

johnswentworth 28 Dec 2024 23:09 UTC
9 points
0
in reply to: TsviBT’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
My impression, from conversations with many people, is that the claim which gets clamped to True is not “this research direction will/can solve alignment” but instead “my research is high value”. So when I’ve explained to someone why their current direction is utterly insufficient, they usually won’t deny some class of problems. They’ll instead tell me that the research still seems valuable even though it isn’t addressing a bottleneck, or that their research is maybe a useful part of a bigger solution which involves many other parts, or that their research is maybe useful step toward something better.
(Though admittedly I usually try to “meet people where they’re at”, by presenting failure-modes which won’t parse as weird to them. If you’re just directly explaining e.g. dangers of internal RSI, I can see where people might instead just assume away internal RSI or some such.)
… and then if I were really putting in effort, I’d need to explain that e.g. being a useful part of a bigger solution (which they don’t know the details of) is itself a rather difficult design constraint which they have not at all done the work to satisfy. But usually I wrap up the discussion well before that point; I generally expect that at most one big takeaway from a discussion can stick, and if they already have one then I don’t want to overdo it.

johnswentworth 28 Dec 2024 23:00 UTC
5 points
0
in reply to: Daniel Tan’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
This post isn’t intended to convince anyone at all that people are in fact streetlighting. This post is intended to present my own models and best guesses at what to do about it to people who are already convinced that most researchers in the field are streetlighting. They are the audience.

johnswentworth 28 Dec 2024 9:48 UTC
16 points
0
in reply to: leogao’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I think you have two main points here, which require two separate responses. I’ll do them opposite the order you presented them.
Your second point, paraphrased: 90% of anything is crap, that doesn’t mean there’s no progress. I’m totally on board with that. But in alignment today, it’s not just that 90% of the work is crap, it’s that the most memetically successful work is crap. It’s not the raw volume of crap that’s the issue so much as the memetic selection pressures.
Your first point, paraphrased: progress toward the the hard problem does not necessarily immediately look like tackling the meat of the hard problem directly. I buy that to some extent, but there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem, whether direct or otherwise. And indeed, I would claim that prosaic alignment as a category is a case where people are not making progress on the hard problems, whether direct or otherwise. In particular, one relevant criterion to look at here is generalizability: is the work being done sufficiently general/robust that it will still be relevant once the rest of the problem is solved (and multiple things change in not-yet-predictable ways in order to solve the rest of the problem)? See e.g. this recent comment for an object-level example of what I mean.

johnswentworth 28 Dec 2024 0:13 UTC
5 points
2
in reply to: Zach Stein-Perlman’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
From the post:
… but crucially, the details of the rationalizations aren’t that relevant to this post. Someone who’s flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they’ll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.