Are you making a forecast about the inability of AIs in, say, 2026 to operate mostly autonomously for long periods in diverse environments, fulfilling goals? I’d potentially be interested to place bets with you if so.
My forecast would be that an AI that operates autonomously for long periods would be composed of pieces that make human-interpretable progress in the short term. For example, a self-driving car will be able to eventually to drive to New York to Los Angeles, but I believe it would do so by decomposing the task into many small tasks of getting from point A to B. It would not do so by sending it out to the world (or even a simulated world) and repeatedly playing a game where it gets a reward if it reaches Los Angeles, and gets nothing if it doesn’t.
That sounds very different to me from “the current paradigm is almost in direct contradiction to long-term goals.” Maybe we agree after all. Possible remaining differences between us: Are you imagining 0 end-to-end training of the system, or just a small amount? For example, consider WebGPT and OpenAI’s more recent Minecraft agent. They were trained for many steps with unsupervised learning and then fine-tuned for a bit with end-to-end RL, if I recall correctly. Are you saying that insofar as AI opererates autonomously for more than, say, 100,000 serial forward passes, it’ll involve 0 end-to-end training? If so then I’d disagree and say it’ll probably involve some.
Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
My high-level take on this disagreement:
If you train a system to accomplish tasks by taking a large number of small human-like steps, and then you finetune the system “a bit” based on actually achieving the goal, I think you will mostly get systems that pursue goals by taking human-like short-term steps but do so more competently. They will do things like avoiding strategies that don’t work in practice for them, and having better intuitions about how to select amongst many possible options; they won’t do things like carrying out long-term plans that aren’t made out of human-legible short-term steps.
I think that:
If these systems already understand a lot of human-illegible stuff, they might quickly start using it when fine-tuned and that may not always result in changes to legible plans.
If these systems start taking plans that achieve the goals in ways that humans don’t like, and then we fine-tune based on individual plan steps (while continuing to select for success), then we you are likely to train the system to obfuscate its scheme.
If you train long enough end-to-end you will get novel behaviors that can be scarier, although I think the computational cost for doing so may be very large
And overall I think there are enough threat models that we should be worried, and should try to develop machinery so that we don’t need to do the kind of training that could result in doom. But I also think the most likely scenario is more along the lines of what the OP is imagining, and we can stay significantly safer by e.g. having consensus at ML labs that #2 is likely to be scary and should be considered unacceptable. Ultimately what’s most important is probably understanding how to determine empirically which world you are in.
Are you making a forecast about the inability of AIs in, say, 2026 to operate mostly autonomously for long periods in diverse environments, fulfilling goals? I’d potentially be interested to place bets with you if so.
My forecast would be that an AI that operates autonomously for long periods would be composed of pieces that make human-interpretable progress in the short term. For example, a self-driving car will be able to eventually to drive to New York to Los Angeles, but I believe it would do so by decomposing the task into many small tasks of getting from point A to B. It would not do so by sending it out to the world (or even a simulated world) and repeatedly playing a game where it gets a reward if it reaches Los Angeles, and gets nothing if it doesn’t.
That sounds very different to me from “the current paradigm is almost in direct contradiction to long-term goals.” Maybe we agree after all. Possible remaining differences between us: Are you imagining 0 end-to-end training of the system, or just a small amount? For example, consider WebGPT and OpenAI’s more recent Minecraft agent. They were trained for many steps with unsupervised learning and then fine-tuned for a bit with end-to-end RL, if I recall correctly. Are you saying that insofar as AI opererates autonomously for more than, say, 100,000 serial forward passes, it’ll involve 0 end-to-end training? If so then I’d disagree and say it’ll probably involve some.
Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
My high-level take on this disagreement:
If you train a system to accomplish tasks by taking a large number of small human-like steps, and then you finetune the system “a bit” based on actually achieving the goal, I think you will mostly get systems that pursue goals by taking human-like short-term steps but do so more competently. They will do things like avoiding strategies that don’t work in practice for them, and having better intuitions about how to select amongst many possible options; they won’t do things like carrying out long-term plans that aren’t made out of human-legible short-term steps.
I think that:
If these systems already understand a lot of human-illegible stuff, they might quickly start using it when fine-tuned and that may not always result in changes to legible plans.
If these systems start taking plans that achieve the goals in ways that humans don’t like, and then we fine-tune based on individual plan steps (while continuing to select for success), then we you are likely to train the system to obfuscate its scheme.
If you train long enough end-to-end you will get novel behaviors that can be scarier, although I think the computational cost for doing so may be very large
And overall I think there are enough threat models that we should be worried, and should try to develop machinery so that we don’t need to do the kind of training that could result in doom. But I also think the most likely scenario is more along the lines of what the OP is imagining, and we can stay significantly safer by e.g. having consensus at ML labs that #2 is likely to be scary and should be considered unacceptable. Ultimately what’s most important is probably understanding how to determine empirically which world you are in.