I more-or-less agree with Eliezer’s comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn’t know him in 1996). I have a small beef with his bolded “MIRI is always in every instance” claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they’ve ever said while employed at MIRI).
What I am trying to say is something looser and more gestalt. I do think what I am saying contains some disagreement with some spirit-of-MIRI, and possibly some specific others at MIRI, such that I could say I’ve updated on the modern progress of AI in a different way than they have.
For example, in my update, the modern progress of LLMs points towards the Paul side of some Eliezer-Paul debates. (I would have to think harder about how to spell out exactly which Eliezer-Paul debates.)
One thing I can say is that I myself often argued using “naive misinterpretation”-like cases such as the paperclip example. However, I was also very aware of the Eliezer-meme “the AI will understand what the humans mean, it just won’t care”. I would have predicted difficulty in building a system which correctly interprets and correctly cares about human requests to the extent that GPT4 does.
This does not mean that AI safety is easy, or that it is solved; only that it is easier than I anticipated at this particular level of capability.
Getting more specific to what I wrote in the post:
My claim is that modern LLMs are “doing roughly what they seem like they are doing” and “internalize human intuitive concepts”. This does include some kind of claim that these systems are more-or-less ethical (they appear to be trying to be helpful and friendly, therefore they “roughly are”).
The reason I don’t think this contradicts with Eliezer’s bolded claim (“Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model”) is that I read Eliezer as talking about strongly superhuman AI with this claim. It is not too difficult to get something into the values of some basic reinforcement learning agent, to the extent that something like that has values worth speaking of. It gets increasingly difficult as the agent gets cleverer. At the level of intelligence of, say, GPT4, there is not a clear difference between getting the LLM to really care about something vs merely getting those values into its predictive model. It may be deceptive or honest; or, it could even be meaningless to classify it as deceptive or honest. This is less true of o1, since we can see it actively scheming to deceive.
I more-or-less agree with Eliezer’s comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn’t know him in 1996). I have a small beef with his bolded “MIRI is always in every instance” claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they’ve ever said while employed at MIRI).
What I am trying to say is something looser and more gestalt. I do think what I am saying contains some disagreement with some spirit-of-MIRI, and possibly some specific others at MIRI, such that I could say I’ve updated on the modern progress of AI in a different way than they have.
For example, in my update, the modern progress of LLMs points towards the Paul side of some Eliezer-Paul debates. (I would have to think harder about how to spell out exactly which Eliezer-Paul debates.)
One thing I can say is that I myself often argued using “naive misinterpretation”-like cases such as the paperclip example. However, I was also very aware of the Eliezer-meme “the AI will understand what the humans mean, it just won’t care”. I would have predicted difficulty in building a system which correctly interprets and correctly cares about human requests to the extent that GPT4 does.
This does not mean that AI safety is easy, or that it is solved; only that it is easier than I anticipated at this particular level of capability.
Getting more specific to what I wrote in the post:
My claim is that modern LLMs are “doing roughly what they seem like they are doing” and “internalize human intuitive concepts”. This does include some kind of claim that these systems are more-or-less ethical (they appear to be trying to be helpful and friendly, therefore they “roughly are”).
The reason I don’t think this contradicts with Eliezer’s bolded claim (“Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model”) is that I read Eliezer as talking about strongly superhuman AI with this claim. It is not too difficult to get something into the values of some basic reinforcement learning agent, to the extent that something like that has values worth speaking of. It gets increasingly difficult as the agent gets cleverer. At the level of intelligence of, say, GPT4, there is not a clear difference between getting the LLM to really care about something vs merely getting those values into its predictive model. It may be deceptive or honest; or, it could even be meaningless to classify it as deceptive or honest. This is less true of o1, since we can see it actively scheming to deceive.