This is kinda long. If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward?
(I also echo the commenter who’s confused about why you’d reply to the obviously simplified presentation from an off-the-cuff podcast rather than the more detailed arguments elsewhere.)
Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is “This is kinda long.”? I shouldn’t be that surprised because, IIRC, you said something similar in response to Zack Davis’ essays on the Map and Territory distinction, but that’s ancillary and AI is core to your memeplex.
I have heard repeated claims that people don’t engage with the alignment communities’ ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does not cause people to believe there’s no reason to engage with your ideas because you will brush them off. Yes, nutpicking e/accs on Twitter is much easier and probably more hedonic, but they’re not convincible and Quinton here is.
I would agree with this if Eliezer had never properly engaged with critics, but he’s done that extensively. I don’t think there should be a norm that you have to engage with everyone, and “ok choose one point, I’ll respond to that” seems like better than not engaging with it at all. (Would you have been more enraged if he hadn’t commented anything?)
The problem is that the even if the model of Quintin Pope is wrong, there is other evidence that contradicts the AI doom premise that Eliezer ignores, and in this I believe it is a confirmation bias at work here.
Also, any issues with Quintin Pope’s model is going to be subtle, not obvious, and it’s a real difference to argue against good arguments + bad arguments from only bad arguments.
The problem is that the even if the model of Quintin Pope is wrong, there is other evidence that contradicts the AI doom premise that Eliezer ignores, and in this I believe it is a confirmation bias at work here.
I think that this is a statement Eliezer does not believe is true, and which the conversations in the MIRI conversations sequence failed to convince him of. Which is the point: since Eliezer has already engaged in extensive back-and-forth with critics of his broad view (including the likes of Paul Christiano, Richard Ngo, Rohin Shah, etc), there is actually not much continued expected update to be found in engaging with someone else who posts a criticism of his view. Do you think otherwise?
What I was talking about is that Eliezer (And arguably the entire MIRI-sphere) ignored evidence that AI safety could actually work and doesn’t need entirely new paradigms, and one of the best examples of empirical work is the Pretraining from Human Feedback.
The big improvements compared to other methods are:
It can avoid deceptive alignment because it gives a simple goal that’s myopic, completely negating the incentives for deceptively aligned AI.
It cannot affect the distribution it’s trained on, since it’s purely offline learning, meaning we can enforce an IID assumption, and enforce a Cartesian boundary, completely avoiding embedded agency. It cannot hack the distribution it has, unlike online learning, meaning it can’t unboundedly Goodhart the values we instill.
Increasing the data set aligns it more and more, essentially meaning we can trust the AI to be aligned as it grows more capable, and improves it’s alignment.
Now I don’t blame Eliezer for ignoring this piece specifically too much, as I think it didn’t attract much attention.
But the reason I’m mentioning this is that this is evidence against the worldview of Eliezer and a lot of pessimists who believe empirical evidence doesn’t work for the alignment field, and Eliezer and a lot of pessimists seem to systematically ignore evidence that harms their case.
Could you elaborate on what you mean by “avoid embedded agency”? I don’t understand how one avoids it. Any solution that avoids having to worry about it in your AGI will fall apart once it becomes a deployed superintelligence.
I think there’s a double meaning to the word “Alignment” where people now use it to refer to making LLMs say nice things and assume that this extrapolates to aligning the goals of agentic systems. The former is only a subproblem of the latter. When you say “Increasing the data set aligns it more and more, essentially meaning we can trust the AI to be aligned as it grows more capable, and improves it’s alignment” I question if we really have evidence that this relationship will hold indefinitely.
One of the issues with embedded agency is that you can’t reliably take advantage of the IID assumption, and in particular you can’t hold data fixed. You also have the issue of potentially having the AI hacking the process, given it’s embeddedness, since there isn’t a way before Pretraining from Human Feedback to translate Cartesian boundaries, or at least a subset of boundaries into the embedded universe.
The point here is we don’t have to solve the problem, as it’s only a problem if we let the AI control the updating process like online training.
Instead, we give the AI a data set, and offline train it so that it learns what alignment looks like before we give it general capabilities.
In particular, we can create a Cartesian boundary between IID and OOD inputs that work in an embedded setting, and the AI has no control over the data set of human values, meaning it can’t gradient or reward hack the humans into having different values, or unboundedly Goodhart human values, which would undermine the project. This is another Cartesian boundary, though this one is the boundary between an AI’s values, and a human’s values, and the AI can’t hack the human values if it’s offline trained.
I think there’s a double meaning to the word “Alignment” where people now use it to refer to making LLMs say nice things and assume that this extrapolates to aligning the goals of agentic systems.
I disagree, and I think I can explain why. The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want, so if we can reliably shift it towards niceness, than we have techniques to align our agents/simulators.
The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want
I don’t see how the bolded follows from the unbolded, sorry. Could you explain in more detail how you reached this conclusion?
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
The first part here makes sense, you’re saying you can train it in such a fashion that it avoids the issues of embedded agency during training (among other things) and then guarantee that the alignment will hold in deployment (when it must be an embedded agent almost by definition)
The second part I think I think I disagree with. Does the paper really “show that we can align AI to any goal we want”? That seems like an extremely strong statement.
Actually this sort of highlights what I mean by the dual use of ‘alignment’ here. You were talking about aligning a model with human values that will end up being deployed (and being an embedded agent) but then we’re using ‘align’ to refer to language model outputs.
The second part I think I think I disagree with. Does the paper really “show that we can align AI to any goal we want”? That seems like an extremely strong statement.
Yes, though admittedly I’m making some inferences here.
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
I believe our disagreement stems from the fact that I am skeptical of the idea that statements made about contemporary language models can be extrapolated to apply to all existentially risky AI systems.
I definitely agree that some version of this is the crux, at least on how well we can generalize the result, since I think it does more generally apply than just contemporary language models, and I suspect it applies to almost all AI that can use Pretraining from Human Feedback, which is offline training, so the crux is really how much can we expect a alignment technique to generalize and scale
I agree that Eliezer shouldn’t have to respond to everything, and that he is well engaged with his critics. I would in fact have preferred it if he had simply said nothing at all, in this particular case. Probably, deep down, I prefer that for some complicated social reasons but I don’t think they’re antisocial reasons and have more to do with the (fixable) rudeness inherent in the way he replied.
I also agree that the comment came across as rude. I mostly give Eliezer a pass for this kind of rudeness because he’s wound up in the genuinely awkward position of being a well-known intellectual figure (at least in these circles), which creates a natural asymmetry between him and (most of) his critics.
I’m open to being convinced that I’m making a mistake here, but at present my view is that comments primarily concerning how Eliezer’s response tugs at the social fabric (including the upthread reply from iceman) are generally unproductive.
(Quentin, to his credit, responded by directly answering Eliezer’s question, and indeed the resulting (short) thread seems to have resulted in some clarification. I have a lot more respect for that kind of object-level response, than I do for responses along the lines of iceman’s reply.)
I’m open to being convinced that I’m making a mistake here, but at present my view is that comments primarily concerning how Eliezer’s responses tug at the social fabric (including the original response from iceman) are usually not productive.
That’s reasonable and I generally agree. I’m not sure what to think about Eliezer’s comment atm except that it upsets me when it maybe shouldn’t, and that I also understand the awkward position he’s in. I definitely don’t want to derail the discussion, here.
I think we should index lesswrong/sequences/etc and combine it with GPT-3. This way we can query it and find out if someone has already answered a (similar) question.
Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me. To grapple with the intellectual content of my ideas, consider picking one item from “A List of Lethalities” and engaging with that.
16.Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about…
and explained why I didn’t think we should put much weight on the evolution analogy when thinking about AI.
In the 7 months since I made that post, it’s had < 5% of the comments engagement that this post has gotten in a day.
In the 7 months since I made that post, it’s had < 5% of the comments engagement that this post has gotten in a day.
Popular and off-the-cuff presentations often get discussed because it is fun to talk about how the off-the-cuff presentation has various flaws. Most comments get generated by demon threads and scissor statements, sadly. We’ve done some things to combat that, and definitely not all threads with lots of comments are the result of people being slightly triggered and misunderstanding each other, but a quite substantial fraction are.
“Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again”
I imagine (edit: wrongly) it was less “choosing” and more “he encountered the podcast first because it has a vastly larger audience, and had thoughts about it.”
I also doubt “just engage with X” was an available action. The podcast transcript doesn’t mention List of Lethalities, LessWrong, or the Sequences, so how is a listener supposed to find it?
I also hate it when people don’t engage with the strongest form of my work, and wouldn’t consider myself obligated to respond if they engaged with a weaker form (or if they engaged with the strongest one, barring additional obligation). But I think this is just what happens when someone goes on a podcast aimed at audiences that don’t already know them.
I agree with this heuristic in general, but will observe Quintin’s first post here was over two years ago and he commented on A List of Lethalities; I do think it’d be fair for him to respond with “what do you think this post was?”.
Vaniver is right. Note that I did specifically describe myself as an “alignment insider” at the start of this post. I’ve read A List of Lethalities and lots of other writing by Yudkowsky. Though the post I’d cite in response to the “you’re not engaging with the strongest forms of my argument” claim would be the one where I pretty much did what Yudkowsky suggests:
To grapple with the intellectual content of my ideas, consider picking one item from “A List of Lethalities” and engaging with that.
16.Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about…
and then argues that we shouldn’t use evolution as our central example of an “outer optimization criteria versus inner formed values” outcome.
You can also see my comment here for some of what led me to write about the podcast specifically.
The comment enrages me too, but the reasons you have given seem like post-justification. The real reason why it’s enraging is that it rudely and dramatically implies that Eliezer’s time is much more valuable than the OP’s, and that it’s up to OP to summarize them for him. If he actually wanted to ask OP what the strongest point was he should have just DMed him instead of engineering this public spectacle.
I want people to not discuss things in DMs, and discuss things publicly more. I also don’t think this is embarrassing for Quintin, or at all a public spectacle.
The real reason why it’s enraging is that it rudely and dramatically implies that Eliezer’s time is much more valuable than the OP’s
It does imply that, but it’s likely true that Eliezer’s time is more valuable (or at least in more demand) than OP’s. I also don’t think Eliezer (or anyone else) should have to spend all that much effort worrying about if what they’re about to say might possibly come off as impolite or uncordial.
If he actually wanted to ask OP what the strongest point was he should have just DMed him instead of engineering this public spectacle.
I don’t agree here. Commenting publicly opens the floor up for anyone to summarize the post or to submit what they think is the strongest point. I think it’s actually less pressure on Quintin this way.
I think that both of you are correct: Eliezer should have DMed Quintin Pope instead, and Eliezer hasn’t noticed that actual arguments were given, and that it sounds like an excuse to ignore disconfirming evidence.
This crystallizes a thought I had about Eliezer: Eliezer has increasingly terrible epistemics on AI doom, and a person should ignore Eliezer’s arguments, since they won’t ever update towards optimism, even if it’s warranted, and has real issues engaging people he doesn’t share his views and don’t give bad arguments.
The “strongest” foot I could put forwards is my response to “On current AI not being self-improving:”, where I’m pretty sure you’re just wrong.
However, I’d be most interested in hearing your response to the parts of this post that are about analogies to evolution, and why they’re not that informative for alignment, which start at:
Yudkowsky argues that we can’t point an AI’s learned cognitive faculties in any particular direction because the “hill-climbing paradigm” is incapable of meaningfully interfacing with the inner values of the intelligences it creates.
and end at:
Yudkowsky tries to predict the inner goals of a GPT-like model.
However, the discussion of evolution is much longer than the discussion on self-improvement in current AIs, so look at whichever you feel you have time for.
The “strongest” foot I could put forwards is my response to “On current AI not being self-improving:”, where I’m pretty sure you’re just wrong.
You straightforwardly completely misunderstood what I was trying to say on the Bankless podcast: I was saying that GPT-4 does not get smarter each time an instance of it is run in inference mode.
I’ll admit it straight up did not occur to me that you could possibly be analogizing between a human’s lifelong, online learning process, and a single inference run of an already trained model. Those are just completely different things in my ontology.
Anyways, thank you for your response. I actually do think it helped clarify your perspective for me.
Edit: I have now included Yudkowsky’s correction of his intent in the post, as well as an explanation of why I think his corrected argument is still wrong.
Well, this is insanely disappointing. Yes, the OP shouldn’t have directly replied to the Bankless podcast like that, but it’s not like he didn’t read your List of Lethalities, or your other writing on AGI risk. You really have no excuse for brushing off very thorough and honest criticism such as this, particularly the sections that talk about alignment.
And as others have noted, Eliezer Yudkowsky, of all people, complaining about a blog post being long is the height of irony.
This is coming from someone who’s mostly agreed with you on AGI risk since reading the Sequences, years ago, and who’s donated to MIRI, by the way.
On the bright side, this does make me (slightly) update my probability of doom downwards.
I think you should use a manifold market to decide on whether you should read the post, instead of the test this comment is putting forth. There’s too much noise here, which isn’t present in a prediction market about the outcome of your engagement.
Even if Eliezer doesn’t think the objections hold up to scrutiny, I think it would still be highly valuable to the wider community for him to share his perspective on them. It feels pretty obvious to me he won’t think they hold up to the scrutiny, but sharing his disagreement would be helpful for the community.
I assume Rob is making this argument internally. I tentatively agree. Writing rebuttals is more difficult than reading them though so not as clear a calculation.
I also didn’t want to make two arguments. One that he should use prediction markets to choose what he reads, and also he should focus on helping the community rather than his specified metric of worthiness.
Is the overall karma for this mostly just people boosting it for visibility? Because I don’t see how this would be a quality comment by any other standards.
LessWrong gives those with higher karma greater post and comment karma starting out, under the assumption that their posts and comments are better and more representative of the community. Probably the high karma you’re seeing is a result of that. I think this is mostly a good thing.
That particular guideline you quoted doesn’t seem to appear on my commenting guidelines text box.
Eliezer, in the world of AI safety, there are two separate conversations: the development of theory and observation, and whatever’s hot in public conversation.
A professional AI safety researcher, hopefully, is mainly developing theory and observation.
However, we have a whole rationalist and EA community, and now a wider lay audience, who are mainly learning of and tracking these matters through the public conversation. It is the ideas and expressions of major AI safety communicators, of whom you are perhaps the most prominent, that will enter their heads. The arguments lay audiences carry may not be fully informed, but they can be influential, both on the decisions they make and the influence they bring to bear on the topic. When you get on a podcast and make off-the-cuff remarks about ideas you’ve been considering for a long time, you’re engaging in public conversation, not developing theory and observation. When somebody critiques your presentation on the podcast, they are doing the same.
The utility of Quintin choosing to address the arguments you have chosen to put forth, off-the-cuff, to that lay audience is similar to the utility you achieve by making them in the first place. You get people interested in your ideas and arguments, and hopefully improve the lay audience’s thinking. Quintin offers a critical take on your arguments, and hopefully improves their thinking further.
I think it’s natural that you are responding as if you thought the main aim of this post was for Quintin to engage you personally in debate. After all, it’s your podcast appearance and the entire post is specifically about your ideas. Yet I think the true point of Quintin’s post is to engage your audience in debate—or, to be a little fanciful—the Eliezer Yudkowsky Homunculus that your audience now has in their heads.
By responding as if Quintin was seeking your personal attention, rather than the attention of your audience, and by explicitly saying you’ll give him the minimum possible amount of your attention, it implicitly frames Quintin’s goal as “summoning Eliezer to a serious debate on AI” and as chiding him for wasting your time by raising a public clamor regarding ideas you find basic, uninteresting, or unworthy of serious debate—though worthy of spreading to a less-informed mass audience, which is why you took the time for the podcast.
Instead, I think Quintin is stepping into the same public communications role that you were doing on the podcast. And that doesn’t actually demand a response from you. I personally would not have been bothered if you’d chosen to say nothing at all. I think it is common for authors of fiction and nonfiction to allow their audience and critics some space and distance to think through and debate their ideas. It’s rare to make a podcast appearance, then show up in internet comments to critique people’s interpretations and misinterpretations. If an audience gets to listen to an author on a podcast, then engage them in a lively discussion or debate, they’ll feel privileged for the attention. If they listen to the podcast, then create their own lively discussion in the author’s absence, they’ll stimulate each others’ intellects. If the author shows up just enough to expression dishumor at the discussion and suggest it’s not really worth his time to be there, they’ll feel like he’s not only being rude, but that he’s misunderstanding “why we’re all gathered here today.”
Personally, I think it’s fine for you to participate as you choose, but I think it is probably wiser to say nothing if you’re not prepared to fully engage. Otherwise, it risks making you look intellectually lazy, and when you just spent the time and energy to appear on a podcast and engage people on important ideas about an important issue, why then undermine the work you’ve just performed in this manner? Refusing to read something because it’s “kinda long” just doesn’t play as high-status high-IQ countersignalling. I don’t think that’s what you’re trying to do, but it’s what it looks like you’re trying to do at first glance.
It’s this disconnect between what I think Quintin’s true goal was in writing this post, and the way your response reframed it, that I think rubs some people the wrong way. I’m not sure about this analysis, but I think it’s worth articulating as a reasonable possibility. But I don’t think there is a definitive right answer or right thing to do or feel in this situation. I would like to see a vigorous but basically collegial discussion on all sides.
Just so we’re clear, I am meaning to specifically convey a thought to Eliezer, but also to “speak for” whatever component of the readership agrees with this perspective, and to try and drive theory and observation on the topic of “how should rationalists interact online” forward. I feel neutral about whether or not Eliezer personally chooses to reply or read this message.
dude just read the damn post at a skim level at least, lol. If you can’t get through this how are you going to do… sigh.
Okay, I’d really rather you read QACI posts deeply than this. But, still. It deserves at least a level 1 read rather than a “can I have a summary?” dismissal.
FWIW, I thought the bit about manifolds in The difficulty of alignment was the strongest foot forward, because it paints a different detailed picture than your description that it’s responding to.
That said, I don’t think Quintin’s picture obviously disagrees with yours (as discussed in my response over here) and I think you’d find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there’s a related one that goes thru on his model.
This post (and the one below) quite bothers me as well.
Yeah I know you can’t have the time to address everything you encounter but you are:
-Not allowed to tell people that they don’t know what they’re talking about until they’ve read a bunch of lengthy articles, then tell someone who has done that and wrote something a fraction of the length to fuck off.
-Not allowed to publicly complain that people don’t criticize you from a place of understanding without reading the attempts to do so
-Not allowed to seriously advocate for policy that would increase the likelihood of armed conflict up to and including nuclear war if you’re not willing to engage with people who give clearly genuine and high effort discussion about why they think the policy is unnecessary.
if you’re not willing to engage with people who give clearly genuine and high effort discussion about why they think the policy is unnecessary
Briefly noting that the policy “I will not respond to every single high-effort criticism I receive” is very different from “I am not willing to engage with people who give high-effort criticism.”
And the policy “sometimes I will ask people who write high-effort criticism to point me to their strongest argument and then I will engage with that” is also different from the two policies mentioned above.
“Publicly humiliating” is an exaggeration and I shouldnt do that. But the show of ordering OP to summarize his points is definitely a little bit beyond “rude”.
I think asking someone to do something is pretty different from ordering someone to do something. I also think for the sake of the conversation it’s good if there’s public, non-DM evidence that he did that: you’d make a pretty different inference if he just picked one point and said that Quintin misunderstood him, compared to once you know that that’s the point Quintin picked as his strongest objection.
This is kinda long. If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward?
(I also echo the commenter who’s confused about why you’d reply to the obviously simplified presentation from an off-the-cuff podcast rather than the more detailed arguments elsewhere.)
This response is enraging.
Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is “This is kinda long.”? I shouldn’t be that surprised because, IIRC, you said something similar in response to Zack Davis’ essays on the Map and Territory distinction, but that’s ancillary and AI is core to your memeplex.
I have heard repeated claims that people don’t engage with the alignment communities’ ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does not cause people to believe there’s no reason to engage with your ideas because you will brush them off. Yes, nutpicking e/accs on Twitter is much easier and probably more hedonic, but they’re not convincible and Quinton here is.
I would agree with this if Eliezer had never properly engaged with critics, but he’s done that extensively. I don’t think there should be a norm that you have to engage with everyone, and “ok choose one point, I’ll respond to that” seems like better than not engaging with it at all. (Would you have been more enraged if he hadn’t commented anything?)
The problem is that the even if the model of Quintin Pope is wrong, there is other evidence that contradicts the AI doom premise that Eliezer ignores, and in this I believe it is a confirmation bias at work here.
Also, any issues with Quintin Pope’s model is going to be subtle, not obvious, and it’s a real difference to argue against good arguments + bad arguments from only bad arguments.
I think that this is a statement Eliezer does not believe is true, and which the conversations in the MIRI conversations sequence failed to convince him of. Which is the point: since Eliezer has already engaged in extensive back-and-forth with critics of his broad view (including the likes of Paul Christiano, Richard Ngo, Rohin Shah, etc), there is actually not much continued expected update to be found in engaging with someone else who posts a criticism of his view. Do you think otherwise?
What I was talking about is that Eliezer (And arguably the entire MIRI-sphere) ignored evidence that AI safety could actually work and doesn’t need entirely new paradigms, and one of the best examples of empirical work is the Pretraining from Human Feedback.
The big improvements compared to other methods are:
It can avoid deceptive alignment because it gives a simple goal that’s myopic, completely negating the incentives for deceptively aligned AI.
It cannot affect the distribution it’s trained on, since it’s purely offline learning, meaning we can enforce an IID assumption, and enforce a Cartesian boundary, completely avoiding embedded agency. It cannot hack the distribution it has, unlike online learning, meaning it can’t unboundedly Goodhart the values we instill.
Increasing the data set aligns it more and more, essentially meaning we can trust the AI to be aligned as it grows more capable, and improves it’s alignment.
The goal found has a small capabilities tax.
There’s a post on it I’ll link here:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences
Now I don’t blame Eliezer for ignoring this piece specifically too much, as I think it didn’t attract much attention.
But the reason I’m mentioning this is that this is evidence against the worldview of Eliezer and a lot of pessimists who believe empirical evidence doesn’t work for the alignment field, and Eliezer and a lot of pessimists seem to systematically ignore evidence that harms their case.
Could you elaborate on what you mean by “avoid embedded agency”? I don’t understand how one avoids it. Any solution that avoids having to worry about it in your AGI will fall apart once it becomes a deployed superintelligence.
I think there’s a double meaning to the word “Alignment” where people now use it to refer to making LLMs say nice things and assume that this extrapolates to aligning the goals of agentic systems. The former is only a subproblem of the latter. When you say “Increasing the data set aligns it more and more, essentially meaning we can trust the AI to be aligned as it grows more capable, and improves it’s alignment” I question if we really have evidence that this relationship will hold indefinitely.
One of the issues with embedded agency is that you can’t reliably take advantage of the IID assumption, and in particular you can’t hold data fixed. You also have the issue of potentially having the AI hacking the process, given it’s embeddedness, since there isn’t a way before Pretraining from Human Feedback to translate Cartesian boundaries, or at least a subset of boundaries into the embedded universe.
The point here is we don’t have to solve the problem, as it’s only a problem if we let the AI control the updating process like online training.
Instead, we give the AI a data set, and offline train it so that it learns what alignment looks like before we give it general capabilities.
In particular, we can create a Cartesian boundary between IID and OOD inputs that work in an embedded setting, and the AI has no control over the data set of human values, meaning it can’t gradient or reward hack the humans into having different values, or unboundedly Goodhart human values, which would undermine the project. This is another Cartesian boundary, though this one is the boundary between an AI’s values, and a human’s values, and the AI can’t hack the human values if it’s offline trained.
I disagree, and I think I can explain why. The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want, so if we can reliably shift it towards niceness, than we have techniques to align our agents/simulators.
I don’t see how the bolded follows from the unbolded, sorry. Could you explain in more detail how you reached this conclusion?
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
The first part here makes sense, you’re saying you can train it in such a fashion that it avoids the issues of embedded agency during training (among other things) and then guarantee that the alignment will hold in deployment (when it must be an embedded agent almost by definition)
The second part I think I think I disagree with. Does the paper really “show that we can align AI to any goal we want”? That seems like an extremely strong statement.
Actually this sort of highlights what I mean by the dual use of ‘alignment’ here. You were talking about aligning a model with human values that will end up being deployed (and being an embedded agent) but then we’re using ‘align’ to refer to language model outputs.
Yes, though admittedly I’m making some inferences here.
The point is that similar techniques can be used to align them, since both (or arguably all goals) are both functionally arbitrary in what we pick, and important for us.
One major point I did elide is the amount of power seeking involved, since in the niceness goal, there’s almost no power seeking involved, unlike the existential risk concerns we have.
But in some of the tests for alignment in Pretraining from Human Feedback, they showed that they can make models avoid taking certain power seeking actions, like getting personal identifying information.
In essence, it’s at least some evidence that as AI gets more capable, that we can make sure that power seeking actions can be avoided if it’s misaligned with human interests.
I believe our disagreement stems from the fact that I am skeptical of the idea that statements made about contemporary language models can be extrapolated to apply to all existentially risky AI systems.
I definitely agree that some version of this is the crux, at least on how well we can generalize the result, since I think it does more generally apply than just contemporary language models, and I suspect it applies to almost all AI that can use Pretraining from Human Feedback, which is offline training, so the crux is really how much can we expect a alignment technique to generalize and scale
I agree that Eliezer shouldn’t have to respond to everything, and that he is well engaged with his critics. I would in fact have preferred it if he had simply said nothing at all, in this particular case. Probably, deep down, I prefer that for some complicated social reasons but I don’t think they’re antisocial reasons and have more to do with the (fixable) rudeness inherent in the way he replied.
I also agree that the comment came across as rude. I mostly give Eliezer a pass for this kind of rudeness because he’s wound up in the genuinely awkward position of being a well-known intellectual figure (at least in these circles), which creates a natural asymmetry between him and (most of) his critics.
I’m open to being convinced that I’m making a mistake here, but at present my view is that comments primarily concerning how Eliezer’s response tugs at the social fabric (including the upthread reply from iceman) are generally unproductive.
(Quentin, to his credit, responded by directly answering Eliezer’s question, and indeed the resulting (short) thread seems to have resulted in some clarification. I have a lot more respect for that kind of object-level response, than I do for responses along the lines of iceman’s reply.)
That’s reasonable and I generally agree. I’m not sure what to think about Eliezer’s comment atm except that it upsets me when it maybe shouldn’t, and that I also understand the awkward position he’s in. I definitely don’t want to derail the discussion, here.
I think we should index lesswrong/sequences/etc and combine it with GPT-3. This way we can query it and find out if someone has already answered a (similar) question.
Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me. To grapple with the intellectual content of my ideas, consider picking one item from “A List of Lethalities” and engaging with that.
I actually did exactly this in a previous post, Evolution is a bad analogy for AGI: inner alignment, where I quoted number 16 from A List of Lethalities:
and explained why I didn’t think we should put much weight on the evolution analogy when thinking about AI.
In the 7 months since I made that post, it’s had < 5% of the comments engagement that this post has gotten in a day.
¯\_(ツ)_/¯
Popular and off-the-cuff presentations often get discussed because it is fun to talk about how the off-the-cuff presentation has various flaws. Most comments get generated by demon threads and scissor statements, sadly. We’ve done some things to combat that, and definitely not all threads with lots of comments are the result of people being slightly triggered and misunderstanding each other, but a quite substantial fraction are.
Are this visible at the typical user level?
Here are some of my disagreements with List of Lethalities. I’ll quote item one:
I imagine (edit: wrongly) it was less “choosing” and more “he encountered the podcast first because it has a vastly larger audience, and had thoughts about it.”
I also doubt “just engage with X” was an available action. The podcast transcript doesn’t mention List of Lethalities, LessWrong, or the Sequences, so how is a listener supposed to find it?
I also hate it when people don’t engage with the strongest form of my work, and wouldn’t consider myself obligated to respond if they engaged with a weaker form (or if they engaged with the strongest one, barring additional obligation). But I think this is just what happens when someone goes on a podcast aimed at audiences that don’t already know them.
I agree with this heuristic in general, but will observe Quintin’s first post here was over two years ago and he commented on A List of Lethalities; I do think it’d be fair for him to respond with “what do you think this post was?”.
Vaniver is right. Note that I did specifically describe myself as an “alignment insider” at the start of this post. I’ve read A List of Lethalities and lots of other writing by Yudkowsky. Though the post I’d cite in response to the “you’re not engaging with the strongest forms of my argument” claim would be the one where I pretty much did what Yudkowsky suggests:
My post Evolution is a bad analogy for AGI: inner alignment specifically addresses List of Lethalities point 16:
and then argues that we shouldn’t use evolution as our central example of an “outer optimization criteria versus inner formed values” outcome.
You can also see my comment here for some of what led me to write about the podcast specifically.
Oh yeah in that case both the complaint and the grumpiness seems much more reasonable.
The comment enrages me too, but the reasons you have given seem like post-justification. The real reason why it’s enraging is that it rudely and dramatically implies that Eliezer’s time is much more valuable than the OP’s, and that it’s up to OP to summarize them for him. If he actually wanted to ask OP what the strongest point was he should have just DMed him instead of engineering this public spectacle.
I want people to not discuss things in DMs, and discuss things publicly more. I also don’t think this is embarrassing for Quintin, or at all a public spectacle.
It does imply that, but it’s likely true that Eliezer’s time is more valuable (or at least in more demand) than OP’s. I also don’t think Eliezer (or anyone else) should have to spend all that much effort worrying about if what they’re about to say might possibly come off as impolite or uncordial.
I don’t agree here. Commenting publicly opens the floor up for anyone to summarize the post or to submit what they think is the strongest point. I think it’s actually less pressure on Quintin this way.
I think that both of you are correct: Eliezer should have DMed Quintin Pope instead, and Eliezer hasn’t noticed that actual arguments were given, and that it sounds like an excuse to ignore disconfirming evidence.
This crystallizes a thought I had about Eliezer: Eliezer has increasingly terrible epistemics on AI doom, and a person should ignore Eliezer’s arguments, since they won’t ever update towards optimism, even if it’s warranted, and has real issues engaging people he doesn’t share his views and don’t give bad arguments.
I have attempted to respond to the whole post over here.
The “strongest” foot I could put forwards is my response to “On current AI not being self-improving:”, where I’m pretty sure you’re just wrong.
However, I’d be most interested in hearing your response to the parts of this post that are about analogies to evolution, and why they’re not that informative for alignment, which start at:
and end at:
However, the discussion of evolution is much longer than the discussion on self-improvement in current AIs, so look at whichever you feel you have time for.
You straightforwardly completely misunderstood what I was trying to say on the Bankless podcast: I was saying that GPT-4 does not get smarter each time an instance of it is run in inference mode.
And that’s that, I guess.
I’ll admit it straight up did not occur to me that you could possibly be analogizing between a human’s lifelong, online learning process, and a single inference run of an already trained model. Those are just completely different things in my ontology.
Anyways, thank you for your response. I actually do think it helped clarify your perspective for me.
Edit: I have now included Yudkowsky’s correction of his intent in the post, as well as an explanation of why I think his corrected argument is still wrong.
Well, this is insanely disappointing. Yes, the OP shouldn’t have directly replied to the Bankless podcast like that, but it’s not like he didn’t read your List of Lethalities, or your other writing on AGI risk. You really have no excuse for brushing off very thorough and honest criticism such as this, particularly the sections that talk about alignment.
And as others have noted, Eliezer Yudkowsky, of all people, complaining about a blog post being long is the height of irony.
This is coming from someone who’s mostly agreed with you on AGI risk since reading the Sequences, years ago, and who’s donated to MIRI, by the way.
On the bright side, this does make me (slightly) update my probability of doom downwards.
I think you should use a manifold market to decide on whether you should read the post, instead of the test this comment is putting forth. There’s too much noise here, which isn’t present in a prediction market about the outcome of your engagement.
Market here: https://manifold.markets/GarrettBaker/will-eliezer-think-there-was-a-sign
Even if Eliezer doesn’t think the objections hold up to scrutiny, I think it would still be highly valuable to the wider community for him to share his perspective on them. It feels pretty obvious to me he won’t think they hold up to the scrutiny, but sharing his disagreement would be helpful for the community.
I assume Rob is making this argument internally. I tentatively agree. Writing rebuttals is more difficult than reading them though so not as clear a calculation.
I also didn’t want to make two arguments. One that he should use prediction markets to choose what he reads, and also he should focus on helping the community rather than his specified metric of worthiness.
Is the overall karma for this mostly just people boosting it for visibility? Because I don’t see how this would be a quality comment by any other standards.
Frontpage comment guidelines:
Maybe try reading the post
LessWrong gives those with higher karma greater post and comment karma starting out, under the assumption that their posts and comments are better and more representative of the community. Probably the high karma you’re seeing is a result of that. I think this is mostly a good thing.
That particular guideline you quoted doesn’t seem to appear on my commenting guidelines text box.
Eliezer, in the world of AI safety, there are two separate conversations: the development of theory and observation, and whatever’s hot in public conversation.
A professional AI safety researcher, hopefully, is mainly developing theory and observation.
However, we have a whole rationalist and EA community, and now a wider lay audience, who are mainly learning of and tracking these matters through the public conversation. It is the ideas and expressions of major AI safety communicators, of whom you are perhaps the most prominent, that will enter their heads. The arguments lay audiences carry may not be fully informed, but they can be influential, both on the decisions they make and the influence they bring to bear on the topic. When you get on a podcast and make off-the-cuff remarks about ideas you’ve been considering for a long time, you’re engaging in public conversation, not developing theory and observation. When somebody critiques your presentation on the podcast, they are doing the same.
The utility of Quintin choosing to address the arguments you have chosen to put forth, off-the-cuff, to that lay audience is similar to the utility you achieve by making them in the first place. You get people interested in your ideas and arguments, and hopefully improve the lay audience’s thinking. Quintin offers a critical take on your arguments, and hopefully improves their thinking further.
I think it’s natural that you are responding as if you thought the main aim of this post was for Quintin to engage you personally in debate. After all, it’s your podcast appearance and the entire post is specifically about your ideas. Yet I think the true point of Quintin’s post is to engage your audience in debate—or, to be a little fanciful—the Eliezer Yudkowsky Homunculus that your audience now has in their heads.
By responding as if Quintin was seeking your personal attention, rather than the attention of your audience, and by explicitly saying you’ll give him the minimum possible amount of your attention, it implicitly frames Quintin’s goal as “summoning Eliezer to a serious debate on AI” and as chiding him for wasting your time by raising a public clamor regarding ideas you find basic, uninteresting, or unworthy of serious debate—though worthy of spreading to a less-informed mass audience, which is why you took the time for the podcast.
Instead, I think Quintin is stepping into the same public communications role that you were doing on the podcast. And that doesn’t actually demand a response from you. I personally would not have been bothered if you’d chosen to say nothing at all. I think it is common for authors of fiction and nonfiction to allow their audience and critics some space and distance to think through and debate their ideas. It’s rare to make a podcast appearance, then show up in internet comments to critique people’s interpretations and misinterpretations. If an audience gets to listen to an author on a podcast, then engage them in a lively discussion or debate, they’ll feel privileged for the attention. If they listen to the podcast, then create their own lively discussion in the author’s absence, they’ll stimulate each others’ intellects. If the author shows up just enough to expression dishumor at the discussion and suggest it’s not really worth his time to be there, they’ll feel like he’s not only being rude, but that he’s misunderstanding “why we’re all gathered here today.”
Personally, I think it’s fine for you to participate as you choose, but I think it is probably wiser to say nothing if you’re not prepared to fully engage. Otherwise, it risks making you look intellectually lazy, and when you just spent the time and energy to appear on a podcast and engage people on important ideas about an important issue, why then undermine the work you’ve just performed in this manner? Refusing to read something because it’s “kinda long” just doesn’t play as high-status high-IQ countersignalling. I don’t think that’s what you’re trying to do, but it’s what it looks like you’re trying to do at first glance.
It’s this disconnect between what I think Quintin’s true goal was in writing this post, and the way your response reframed it, that I think rubs some people the wrong way. I’m not sure about this analysis, but I think it’s worth articulating as a reasonable possibility. But I don’t think there is a definitive right answer or right thing to do or feel in this situation. I would like to see a vigorous but basically collegial discussion on all sides.
Just so we’re clear, I am meaning to specifically convey a thought to Eliezer, but also to “speak for” whatever component of the readership agrees with this perspective, and to try and drive theory and observation on the topic of “how should rationalists interact online” forward. I feel neutral about whether or not Eliezer personally chooses to reply or read this message.
dude just read the damn post at a skim level at least, lol. If you can’t get through this how are you going to do… sigh.
Okay, I’d really rather you read QACI posts deeply than this. But, still. It deserves at least a level 1 read rather than a “can I have a summary?” dismissal.
FWIW, I thought the bit about manifolds in The difficulty of alignment was the strongest foot forward, because it paints a different detailed picture than your description that it’s responding to.
That said, I don’t think Quintin’s picture obviously disagrees with yours (as discussed in my response over here) and I think you’d find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there’s a related one that goes thru on his model.
This post (and the one below) quite bothers me as well.
Yeah I know you can’t have the time to address everything you encounter but you are:
-Not allowed to tell people that they don’t know what they’re talking about until they’ve read a bunch of lengthy articles, then tell someone who has done that and wrote something a fraction of the length to fuck off.
-Not allowed to publicly complain that people don’t criticize you from a place of understanding without reading the attempts to do so
-Not allowed to seriously advocate for policy that would increase the likelihood of armed conflict up to and including nuclear war if you’re not willing to engage with people who give clearly genuine and high effort discussion about why they think the policy is unnecessary.
Briefly noting that the policy “I will not respond to every single high-effort criticism I receive” is very different from “I am not willing to engage with people who give high-effort criticism.”
And the policy “sometimes I will ask people who write high-effort criticism to point me to their strongest argument and then I will engage with that” is also different from the two policies mentioned above.
I would have preferred you to DM Quintin Pope with this request, instead of publicly humiliating him.
This does not seem like it counts as “publicly humiliating” in any way? Rude, sure, but that’s quite different.
“Publicly humiliating” is an exaggeration and I shouldnt do that. But the show of ordering OP to summarize his points is definitely a little bit beyond “rude”.
I think asking someone to do something is pretty different from ordering someone to do something. I also think for the sake of the conversation it’s good if there’s public, non-DM evidence that he did that: you’d make a pretty different inference if he just picked one point and said that Quintin misunderstood him, compared to once you know that that’s the point Quintin picked as his strongest objection.
You might be right.
Well, I’m only arguing from surface features of Eliezer’s comments, so I could be wrong too :P