They can solve it however they like, once they’re past the point of expecting things to work that sometimes don’t work. I have guesses but any group that still needs my hints should wait and augment harder.
I have guesses but any group that still needs my hints should wait and augment harder.
I think this is somewhat harmful to there being a field of (MIRI-style) Agent Foundations. It seems pretty bad to require that people attempting to start in the field have to work out the foundations themselves, I don’t think any scientific fields have worked this way in the past.
Maybe the view is that if people can’t work out the basics then they won’t be able to make progress, but this doesn’t seem at all clear to me. Many physicists in the 20th century were unable to derive the basics of quantum mechanics or general relativity, but once they were given the foundations they were able to do useful work. I think the skills of working out foundations of a new field can be different than building on those foundations.
Also, maybe these “hints” aren’t that useful and so it’s not worth sharing. Or (more likely in my view) the hints are tied up with dangerous information such that sharing increases risk, and you want to have more signal on someone’s ability to do good work before taking that risk.
If folks want to understand how Eliezer would tackle the problem they can read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular. If folks still feel that they need low-confidence hints after that then I think they will probably not benefit much from hearing Eliezer’s current guesses, and I suspect may be trying to guess the teacher’s passwords rather than solve the problem.
Is there actually anyone who has read all those words and thus understands how Eliezer would tackle the problem? (Do you, for example?)
This is not a rhetorical question, nor a trivial one; for example, I notice that @Vanessa Kosoy apparently misunderstood some major part of it (as evidenced by this comment thread), and she’s been a MIRI researcher for 8 years (right? or so LinkedIn says, anyhow). So… who does get it?
I think this is a reasonable response, but also, independently of whether Eliezer successfully got the generators of his thought-process across, the volume of words still seems like substantial evidence that it’s reasonably for Eliezer to not think that marginally writing more will drastically change things from his perspective.
Sure. My comment was not intended to bear on the question of whether it’s useful for Eliezer to write more words or not—I was only responding directly to Ben.
EDIT: Although, of course, if the already-written words haven’t been effective, then that is also evidence that writing more words won’t help. So, either way, I agree with your view.
I am a bit unsure of the standard here for “understands how Eliezer would tackle the problem”. Is the standard “the underlying philosophy was successfully communicated to lots of people”? If so I’ll note as background that it is pretty standard that such underlying philosophies of solving problems are hard to communicate — someone who reads Shannon or Pearl’s writing does not become equal to them, someone who read Faraday’s notes couldn’t pick up his powerful insights, until Maxwell formalized them neatly[1], and (for picking up philosophies broadly) someone who reads Aurelius or Hume or Mill cannot produce the sorts of works those people would have produced had they been alive for longer or in the present day.
Is the standard “people who read Eliezer’s writing go on to do research that Eliezer judges to be remotely on the right track”? Then I think a couple of folks have done stuff here that Eliezer would view as at all on the right track, including lots of MIRI researchers, Wei Dai, John Wentworth, the mesa-optimizers paper, etc. It would be a bit of effort to give me an overview of all the folks who took parts of Eliezer’s writing and did useful things as a direct result of the specific ideas, and then assess how successful that counts as.
My current guess is that Eliezer’s research dreams have not been that successfully picked up by others relative to what I would have predicted 10 years ago. I am not confident why this is — perhaps other institutional support has been lacking for people to work on it, perhaps there have been active forces in the direction of “not particularly trying to understand AI” due to the massive success of engineering approaches over scientific approaches, perhaps Eliezer’s particular approach has flaws that make it not very productive.
I think my primary guess for why there has been less progress on Eliezer’s research dreams is that the subject domain is itself very annoying to get in contact with, due to us not having a variety of superintelligences to play with, and my anticipation that when we do get to that stage our lives will soon be over, so it’s much harder to make any progress on these problems than it is in the vast majority of other domains humans have been successful in.
Nate Soares has a blogpost that discusses this point that I found insightful, here’s a quote.
Faraday discovered a wide array of electromagnetic phenomena, guided by an intuition that he wasn’t able to formalize or transmit except through hundreds of pages of detailed laboratory notes and diagrams; Maxwell later invented the language to describe electromagnetism formally by reading Faraday’s work, and expressed those hundreds of pages of intuitions in three lines.
I am a bit unsure of the standard here for “understands how Eliezer would tackle the problem”.
Huh? Isn’t this a question for you, not for me? You wrote this:
If folks want to understand how Eliezer would tackle the problem they can read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular.
So, the standard is… whatever you had in mind there?
Oh. Here’s a compressed gloss of this conversation as I understand it.
Eliezer: People should give up on producing alignment theory directly and instead should build augmented humans to produce aligned agents.
Vanessa: And how do you think the augmented humans will go about doing so?
Eliezer: I mean, the point here is that’s a problem for them to solve. Insofar as they need my help they have not augmented enough. (Ben reads into Eliezer’s comment: And insofar as you wish to elicit my approach, I have already written over a million words on this, a few marginal guesses ain’t gonna do much.)
Peter: It seems wrong to not share your guesses here, people should not have to build the foundations themselves without help from Eliezer.
Ben: Eliezer has spent many years writing up and helping build the foundations, this comment seems very confusing given that context.
Said: But have Eliezer’s writings worked?
Ben: (I’m not quite sure what the implied relationship is of this question to the question of whether Eliezer has tried very hard to help build a foundation for alignment research, but I will answer the question directly.) To the incredibly high standard that ~nobody ever succeeds at, it has not succeeded. To the lower standard that scientists sometimes succeed at, it has worked a little but has not been sufficient.
My first guess was that you’re implicitly asking because, if it has not been successful, then you think Eliezer still ought to answer questions like Vanessa’s. I am not confident in this guess.
If you’re simply asking for clarification on whether Eliezer’s writing works to convey his philosophy and approach to the alignment problem, I have now told you two standards to which I would evaluate this question and how I think it does on those standards. Do you understand why I wrote what I wrote in my reply to Peter?
Uh… I think you’re somewhat overcomplicating things. Again, you wrote (emphasis mine):
If folks want to understand how Eliezer would tackle the problem they can read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular.
In other words, you wrote that if people want X, they can do Y. (Implying that doing Y will, with non-trivial probability, cause people to gain X.[1]) My question was simply whether there exists anyone who has done Y and now, as a consequence, has X.
There are basically two possible answers to this:
“Yes, persons A, B, C have all done Y, and now, as a consequence, have X.”
Or
“No, there exists no such person who has done Y and now, as a consequence, has X.”
(Which may be because nobody has done Y, or it may be because some people have done Y, but have not gained X.)
There’s not any need to bring in questions of standards or trying hard or… anything like that.
So, to substitute the values back in for the variables, I could ask:
Would you say that you “understand how Eliezer would tackle the problem”?
and:
Have you “read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular”?
Likewise, the same question about Vanessa: does she “understand … etc.”, and has she “read the … etc.”?
Again, what I mean by all of these words is just whatever you meant when you wrote them.
“Implying” in the Gricean sense, of course, not in the logical sense. Strictly speaking, if we drop all Gricean assumptions, we could read your original claim as being analogous to, e.g., “if folks want to grow ten feet fall, they can paint their toenails blue”—which is entirely true, and yet does not, strictly speaking, entail any causal claims. I assumed that this kind of thing is not what you meant, because… that would make your comment basically pointless.
Ah, but I think you have assumed slightly too much in what I meant. I simply meant to say that if someone wants Eliezer’s help in their goal to “work out the foundations” to “a field of (MIRI-style) Agent Foundations” the better way to understand Eliezer’s perspective on these difficult questions is to read the high-effort and thoughtful blog posts and papers he produced that’s intended to communicate his perspective on these questions, rather than ask him for some quick guesses on how one could in-principle solve the problems. I did not mean to imply that (either) way would necessarily work, as the goal itself is hard to achieve, I simply meant that one approach is clearly much more likely than the other to achieve that goal.
(That said, again to answer your question, my current guess is that Nate Soares is an example of a person who read those works and then came to share a lot of Eliezer’s approach to solving the problems. Though I’m honestly not sure how much he would put down to the factors of (a) reading the writing (b) working directly with Eliezer (c) trying to solve the problem himself and coming up with similar approaches. I also think that Wei Dai at least understood enough to make substantial progress on an open research question in decision theory, and similar things can be said of Scott Garrabrant re: logical uncertainty and others at MIRI.)
Fwiw this doesn’t feel like a super helpful comment to me. I think there might be a nearby one that’s more useful, but this felt kinda coy for the sake of being coy.
Yeah, I feel this is quite similar to OpenAI’s plan to defer alignment to future AI researchers, except worse, because if we grant that the plan proposed actually made the augmented humans stably aligned with our values, then it would be far easier to do scalable oversight, because we have a bunch of advantages around controlling AIs, like the fact that it would be socially acceptable to control AI in ways that wouldn’t be socially acceptable to do if it involved humans, the incentives to control AI are much stronger than controlling humans, etc.
I truly feel like Eliezer has reinvented a plan that OpenAI/Anthropic are already doing, except worse, which is deferring alignment work to future intelligences, and Eliezer doesn’t realize this, so the comments treat it as though it’s something new rather than an already done plan, just with AI swapped out for humans.
It’s not just coy, it’s reinventing an idea that’s already there, except worse, and he doesn’t tell you that if you swap the human for AI, it’s already being done.
Link for why AI is easier to control than humans below:
They can solve it however they like, once they’re past the point of expecting things to work that sometimes don’t work. I have guesses but any group that still needs my hints should wait and augment harder.
I think this is somewhat harmful to there being a field of (MIRI-style) Agent Foundations. It seems pretty bad to require that people attempting to start in the field have to work out the foundations themselves, I don’t think any scientific fields have worked this way in the past.
Maybe the view is that if people can’t work out the basics then they won’t be able to make progress, but this doesn’t seem at all clear to me. Many physicists in the 20th century were unable to derive the basics of quantum mechanics or general relativity, but once they were given the foundations they were able to do useful work. I think the skills of working out foundations of a new field can be different than building on those foundations.
Also, maybe these “hints” aren’t that useful and so it’s not worth sharing. Or (more likely in my view) the hints are tied up with dangerous information such that sharing increases risk, and you want to have more signal on someone’s ability to do good work before taking that risk.
If folks want to understand how Eliezer would tackle the problem they can read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular. If folks still feel that they need low-confidence hints after that then I think they will probably not benefit much from hearing Eliezer’s current guesses, and I suspect may be trying to guess the teacher’s passwords rather than solve the problem.
Is there actually anyone who has read all those words and thus understands how Eliezer would tackle the problem? (Do you, for example?)
This is not a rhetorical question, nor a trivial one; for example, I notice that @Vanessa Kosoy apparently misunderstood some major part of it (as evidenced by this comment thread), and she’s been a MIRI researcher for 8 years (right? or so LinkedIn says, anyhow). So… who does get it?
I think this is a reasonable response, but also, independently of whether Eliezer successfully got the generators of his thought-process across, the volume of words still seems like substantial evidence that it’s reasonably for Eliezer to not think that marginally writing more will drastically change things from his perspective.
Sure. My comment was not intended to bear on the question of whether it’s useful for Eliezer to write more words or not—I was only responding directly to Ben.
EDIT: Although, of course, if the already-written words haven’t been effective, then that is also evidence that writing more words won’t help. So, either way, I agree with your view.
Scientific breakthroughs live on the margins, so if he has guesses on how to achieve alignment sharing them could make a huge difference.
I am a bit unsure of the standard here for “understands how Eliezer would tackle the problem”. Is the standard “the underlying philosophy was successfully communicated to lots of people”? If so I’ll note as background that it is pretty standard that such underlying philosophies of solving problems are hard to communicate — someone who reads Shannon or Pearl’s writing does not become equal to them, someone who read Faraday’s notes couldn’t pick up his powerful insights, until Maxwell formalized them neatly[1], and (for picking up philosophies broadly) someone who reads Aurelius or Hume or Mill cannot produce the sorts of works those people would have produced had they been alive for longer or in the present day.
Is the standard “people who read Eliezer’s writing go on to do research that Eliezer judges to be remotely on the right track”? Then I think a couple of folks have done stuff here that Eliezer would view as at all on the right track, including lots of MIRI researchers, Wei Dai, John Wentworth, the mesa-optimizers paper, etc. It would be a bit of effort to give me an overview of all the folks who took parts of Eliezer’s writing and did useful things as a direct result of the specific ideas, and then assess how successful that counts as.
My current guess is that Eliezer’s research dreams have not been that successfully picked up by others relative to what I would have predicted 10 years ago. I am not confident why this is — perhaps other institutional support has been lacking for people to work on it, perhaps there have been active forces in the direction of “not particularly trying to understand AI” due to the massive success of engineering approaches over scientific approaches, perhaps Eliezer’s particular approach has flaws that make it not very productive.
I think my primary guess for why there has been less progress on Eliezer’s research dreams is that the subject domain is itself very annoying to get in contact with, due to us not having a variety of superintelligences to play with, and my anticipation that when we do get to that stage our lives will soon be over, so it’s much harder to make any progress on these problems than it is in the vast majority of other domains humans have been successful in.
Nate Soares has a blogpost that discusses this point that I found insightful, here’s a quote.
Huh? Isn’t this a question for you, not for me? You wrote this:
So, the standard is… whatever you had in mind there?
Oh. Here’s a compressed gloss of this conversation as I understand it.
Eliezer: People should give up on producing alignment theory directly and instead should build augmented humans to produce aligned agents.
Vanessa: And how do you think the augmented humans will go about doing so?
Eliezer: I mean, the point here is that’s a problem for them to solve. Insofar as they need my help they have not augmented enough. (Ben reads into Eliezer’s comment: And insofar as you wish to elicit my approach, I have already written over a million words on this, a few marginal guesses ain’t gonna do much.)
Peter: It seems wrong to not share your guesses here, people should not have to build the foundations themselves without help from Eliezer.
Ben: Eliezer has spent many years writing up and helping build the foundations, this comment seems very confusing given that context.
Said: But have Eliezer’s writings worked?
Ben: (I’m not quite sure what the implied relationship is of this question to the question of whether Eliezer has tried very hard to help build a foundation for alignment research, but I will answer the question directly.) To the incredibly high standard that ~nobody ever succeeds at, it has not succeeded. To the lower standard that scientists sometimes succeed at, it has worked a little but has not been sufficient.
My first guess was that you’re implicitly asking because, if it has not been successful, then you think Eliezer still ought to answer questions like Vanessa’s. I am not confident in this guess.
If you’re simply asking for clarification on whether Eliezer’s writing works to convey his philosophy and approach to the alignment problem, I have now told you two standards to which I would evaluate this question and how I think it does on those standards. Do you understand why I wrote what I wrote in my reply to Peter?
Uh… I think you’re somewhat overcomplicating things. Again, you wrote (emphasis mine):
In other words, you wrote that if people want X, they can do Y. (Implying that doing Y will, with non-trivial probability, cause people to gain X.[1]) My question was simply whether there exists anyone who has done Y and now, as a consequence, has X.
There are basically two possible answers to this:
“Yes, persons A, B, C have all done Y, and now, as a consequence, have X.”
Or
“No, there exists no such person who has done Y and now, as a consequence, has X.”
(Which may be because nobody has done Y, or it may be because some people have done Y, but have not gained X.)
There’s not any need to bring in questions of standards or trying hard or… anything like that.
So, to substitute the values back in for the variables, I could ask:
Would you say that you “understand how Eliezer would tackle the problem”?
and:
Have you “read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular”?
Likewise, the same question about Vanessa: does she “understand … etc.”, and has she “read the … etc.”?
Again, what I mean by all of these words is just whatever you meant when you wrote them.
“Implying” in the Gricean sense, of course, not in the logical sense. Strictly speaking, if we drop all Gricean assumptions, we could read your original claim as being analogous to, e.g., “if folks want to grow ten feet fall, they can paint their toenails blue”—which is entirely true, and yet does not, strictly speaking, entail any causal claims. I assumed that this kind of thing is not what you meant, because… that would make your comment basically pointless.
Ah, but I think you have assumed slightly too much in what I meant. I simply meant to say that if someone wants Eliezer’s help in their goal to “work out the foundations” to “a field of (MIRI-style) Agent Foundations” the better way to understand Eliezer’s perspective on these difficult questions is to read the high-effort and thoughtful blog posts and papers he produced that’s intended to communicate his perspective on these questions, rather than ask him for some quick guesses on how one could in-principle solve the problems. I did not mean to imply that (either) way would necessarily work, as the goal itself is hard to achieve, I simply meant that one approach is clearly much more likely than the other to achieve that goal.
(That said, again to answer your question, my current guess is that Nate Soares is an example of a person who read those works and then came to share a lot of Eliezer’s approach to solving the problems. Though I’m honestly not sure how much he would put down to the factors of (a) reading the writing (b) working directly with Eliezer (c) trying to solve the problem himself and coming up with similar approaches. I also think that Wei Dai at least understood enough to make substantial progress on an open research question in decision theory, and similar things can be said of Scott Garrabrant re: logical uncertainty and others at MIRI.)
Fwiw this doesn’t feel like a super helpful comment to me. I think there might be a nearby one that’s more useful, but this felt kinda coy for the sake of being coy.
Yeah, I feel this is quite similar to OpenAI’s plan to defer alignment to future AI researchers, except worse, because if we grant that the plan proposed actually made the augmented humans stably aligned with our values, then it would be far easier to do scalable oversight, because we have a bunch of advantages around controlling AIs, like the fact that it would be socially acceptable to control AI in ways that wouldn’t be socially acceptable to do if it involved humans, the incentives to control AI are much stronger than controlling humans, etc.
I truly feel like Eliezer has reinvented a plan that OpenAI/Anthropic are already doing, except worse, which is deferring alignment work to future intelligences, and Eliezer doesn’t realize this, so the comments treat it as though it’s something new rather than an already done plan, just with AI swapped out for humans.
It’s not just coy, it’s reinventing an idea that’s already there, except worse, and he doesn’t tell you that if you swap the human for AI, it’s already being done.
Link for why AI is easier to control than humans below:
https://optimists.ai/2023/11/28/ai-is-easy-to-control/
fwiw, this seems false to me and not particularly related to what I was saying.
Even a small probability of solving alignment should have big expected utility modulo exfohazard. So why not share your guesses?