I agree with cubefox: you seem to be misinterpreting the claim that LLMs actually execute your intended instructions as a mere claim about whether LLMs understand your intended instructions. I claim there is simply a sharp distinction between actual execution and correct, legible interpretation of instructions and a simple understanding of those instructions; LLMs do the former, not merely the latter.
Honestly, I think focusing on this element of the discussion is kind of a distraction because, in my opinion, the charitable interpretation of your posts is simply that you never thought that it would be hard to get AIs to exhibit human-level reasonableness at interpreting and executing tasks until AIs reach a certain capability level, and the threshold at which these issues were predicted to arise was always intended to be very far above GPT-4-level. This interpretation of your argument is plausible based on what you wrote, and could indeed save your theory from empirical falsification based on our current observations.
That said, if you want to go this route, and argue that “complexity of wishes”-type issues will eventually start occurring at some level of AI capability, I think it would be beneficial for you to clarify exactly what level you empirically expect we’ll start having the issues of misinterpretation you described. For example, would either of the following observations contradict your theory of alignment?
At some point there’s a multimodal model that is roughly as intelligent as a 99th percentile human on virtual long-horizon tasks (e.g. it can learn how to play Minecraft well after a few hours of in-game play, can work in a variety of remote jobs, and has the ability to pursue coherent goals over several months) and yet this model allows you to shut it off, modify its weights, or otherwise change its mode of operation arbitrarily i.e. it’s corrigible, in a basic sense. Moreover, the model generally executes our instructions as intended, without any evidence of blatant instruction-misinterpretation or disobedience, before letting us shut it down.
AIs are widely deployed across the economy to automate a wide range of labor, including the task of scientific research. This has the effect of accelerating technological progress, prompting the development of nanotechnology that is sophisticated enough to allow for the creation of strawberries that are identical on the cellular but not molecular level. As a result, you can purchase such strawberries at a store, and we haven’t all died yet despite these developments.
I agree with cubefox: you seem to be misinterpreting the claim that LLMs actually execute your intended instructions as a mere claim about whether LLMs understand your intended instructions. I claim there is simply a sharp distinction between actual execution and correct, legible interpretation of instructions and a simple understanding of those instructions; LLMs do the former, not merely the latter.
Honestly, I think focusing on this element of the discussion is kind of a distraction because, in my opinion, the charitable interpretation of your posts is simply that you never thought that it would be hard to get AIs to exhibit human-level reasonableness at interpreting and executing tasks until AIs reach a certain capability level, and the threshold at which these issues were predicted to arise was always intended to be very far above GPT-4-level. This interpretation of your argument is plausible based on what you wrote, and could indeed save your theory from empirical falsification based on our current observations.
That said, if you want to go this route, and argue that “complexity of wishes”-type issues will eventually start occurring at some level of AI capability, I think it would be beneficial for you to clarify exactly what level you empirically expect we’ll start having the issues of misinterpretation you described. For example, would either of the following observations contradict your theory of alignment?
At some point there’s a multimodal model that is roughly as intelligent as a 99th percentile human on virtual long-horizon tasks (e.g. it can learn how to play Minecraft well after a few hours of in-game play, can work in a variety of remote jobs, and has the ability to pursue coherent goals over several months) and yet this model allows you to shut it off, modify its weights, or otherwise change its mode of operation arbitrarily i.e. it’s corrigible, in a basic sense. Moreover, the model generally executes our instructions as intended, without any evidence of blatant instruction-misinterpretation or disobedience, before letting us shut it down.
AIs are widely deployed across the economy to automate a wide range of labor, including the task of scientific research. This has the effect of accelerating technological progress, prompting the development of nanotechnology that is sophisticated enough to allow for the creation of strawberries that are identical on the cellular but not molecular level. As a result, you can purchase such strawberries at a store, and we haven’t all died yet despite these developments.