I think some people have the misapprehension that one can just meditate on abstract properties of “advanced systems” and come to good conclusions about unknown results “in the limit of ML training”, without much in the way of technical knowledge about actual machine learning results or even a track record in predicting results of training.
For example, several respected thinkers have uttered to me English sentences like “I don’t see what’s educational about watching a line go down for the 50th time” and “Studying modern ML systems to understand future ones seems like studying the neurobiology of flatworms to understand the psychology of aliens.”
I vehemently disagree. I am also concerned about a community which (seems to) foster such sentiment.
The problem is not that you can “just meditate and come to good conclusions”, the problem is that “technical knowledge about actual machine learning results” doesn’t seem like good path either.
Like, we can get from NN trained to do modular addition the fact that it performs Fourier transform, because we clearly know what Fourier transform is, but I don’t see any clear path to get from neural network the fact that its output is both useful and safe, because we don’t have any practical operationalization of what “useful and safe” is. If we had solution to MIRI problem “which program being run on infinitely large computer produces aligned outcome”, we could try to understand how good NN in approximating this program, using aforementioned technical knowledge, and have substantial hope, for example.
I think the answer to the question of how well realistic NN-like systems with finite compute approximate the results of hypothetical utility maximizers with infinite compute is “not very well at all”.
So the MIRI train of thought, as I understand it, goes something like
You cannot predict the specific moves that a superhuman chess-playing AI will make, but you can predict that the final board state will be one in which the chess-playing AI has won.
The chess AI is able to do this because it can accurately predict the likely outcomes of its own actions, and so is able to compute the utility of each of its possible actions and then effectively do an argmax over them to pick the best one, which results in the best outcome according to its utility function.
Similarly, you will not be able to predict the specific actions that a “sufficiently powerful” utility maximizer will make, but you can predict that its utility function will be maximized.
For most utility functions about things in the real world, the configuration of matter that maximizes that utility function is not a configuration of matter that supports human life.
Actual future AI systems that will show up in the real world in the next few decades will be “sufficiently powerful” utility maximizers, and so this is a useful and predictive model of what the near future will look like.
I think the last few years in ML have made points 2 and 5 look particularly shaky here. For example, the actual architecture of the SOTA chess-playing systems doesn’t particularly resemble a cheaper version of the optimal-with-infinite-compute thing of “minmax over tree search”, but instead seems to be a different thing of “pile a bunch of situation-specific heuristics on top of each other, and then tweak the heuristics based on how well they do in practice”.
Which, for me at least, suggests that looking at what the optimal-with-infinite-compute thing would do might not be very informative for what actual systems which will show up in the next few decades will do.
one can just meditate on abstract properties of “advanced systems” and come to good conclusions about unknown results “in the limit of ML training”
I think this is a pretty straw characterization of the opposing viewpoint (or at least my own view), which is that intuitions about advanced AI systems should come from a wide variety of empirical domains and sources, and a focus on current-paradigm ML research is overly narrow.
Research and lessons from fields like game theory, economics, computer security, distributed systems, cognitive psychology, business, history, and more seem highly relevant to questions about what advanced AI systems will look like. I think the original Sequences and much of the best agent foundations research is an attempt to synthesize the lessons from these fields into a somewhat unified (but often informal) theory of the effects that intelligent, autonomous systems have on the world around us, through the lens of rationality, reductionism, empiricism, etc.
And whether or not you think they succeeded at that synthesis at all, humans are still the sole example of systems capable of having truly consequential and valuable effects of any kind. So I think it makes sense for the figure of merit for such theories and worldviews to be based on how well they explain these effects, rather than focusing solely or even mostly on how well they explain relatively narrow results about current ML systems.
Context for my original comment: I think that the key thing we want to do is predict the generalization of future neural networks. What will they do in what situations?
For some reason, certain people think that pretraining will produce consequentialist inner optimizers. This is generally grounded out as a highly specific claim about the functions implemented by most low-loss parameterizations of somewhat-unknown future model architectures trained on somewhat-unknown data distributions.
I am in particular thinking about “Playing the training game” reasoning, which is—at its core—extremely speculative and informal claims about inductive biases / the functions implemented by such parameterizations. If a person (like myself pre-2022) is talking about how AIs “might play the training game”, but also this person doesn’t know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucialfor reasoning about inductive biases), then I become extremely concerned. To put it mildly.
Given that clarification which was not present in the original comment,
I disagree on game theory, econ, computer security, business, and history; those seem totally irrelevant for reasoning about inductive biases (and you might agree). However they seem useful for reasoning about the impact of AI on society as it becomes integrated.
I think that the key thing we want to do is predict the generalization of future neural networks.
It’s not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.
My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don’t need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there.
I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don’t know what training/design process would get us to AGI. Which means we can’t make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, which means the abstract often-intuitive arguments from other fields do have relevant things to say.
And I’m not seeing a lot of ironclad arguments that favour “pretraining + RLHF is going to get us to AGI” over “pretraining + RLHF is not going to get us to AGI”. The claim that e. g. shard theory generalizes to AGI is at least as tenuous as the claim that it doesn’t.
Flagging that this is one of the main claims which we seem to dispute; I do not concede this point FWIW.
It’s not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.
But to expand on my first comment, when I look around and see any kind of large effect on the world, good or bad (e.g. a moral catastrophe, a successful business, strong optimization around a MacGuffin), I can trace the causality through a path that is invariably well-modeled by applying concepts like expected utility theory (or geometric rationality, if you prefer), consequentialism, deception, Goodharting, maximization, etc. to the humans involved.
I read Humans provide an untapped wealth of evidence about alignment and much of your other writing as disagreeing with the (somewhat vague / general) claim that these concepts are really so fundamental, and that you think wielding them to speculate about future AI systems is privileging the hypothesis or otherwise frequently leads people astray. (Roughly accurate summary of your own views?)
Regardless of how well this describes your actual views or not, I think differing answers to the question of how fundamental this family of concepts is, and what kind of reasoning mistakes people typically make when they apply them to AI, is not really a disagreement about neural networks specifically or even AI generally.
game theory, econ, computer security, business, and history
These seem most useful if you expect complex multi-agent training in the future. But even if not I wouldn’t write them off entirely, given the existence of complex systems theory being a connection between them all & nn training (except computer security). For similar reasons, biology, neuroscience, and statistical & condensed matter (& other sorts of chaotic) physics start to seem useful.
“But also this person doesn’t know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”
I think many people have the misapprehension that one can just meditate on abstract properties of “advanced systems” and come to good conclusions about unknown results “in the limit of ML training”, without much in the way of technical knowledge about actual machine learning results or even a track record in predicting results of training.
Two days ago I argued that GPTs would not be an existential risk with someone no matter how extremely they were scaled up, and eventually it turned out that they took the adjectives “generative pretrained” to be separable descriptors, whereas I took them to refer to a narrow specific training method.
For example, several respected thinkers have uttered to me English sentences like “I don’t see what’s educational about watching a line go down for the 50th time” and “Studying modern ML systems to understand future ones seems like studying the neurobiology of flatworms to understand the psychology of aliens.”
These statements are not necessarily (at least by themselves; possibly additional context is missing) examples of discussion about what happens “in the limit of ML training”, as these people may be concerned about the limit of ML architecture development rather than simply training.
I think some people have the misapprehension that one can just meditate on abstract properties of “advanced systems” and come to good conclusions about unknown results “in the limit of ML training”, without much in the way of technical knowledge about actual machine learning results or even a track record in predicting results of training.
For example, several respected thinkers have uttered to me English sentences like “I don’t see what’s educational about watching a line go down for the 50th time” and “Studying modern ML systems to understand future ones seems like studying the neurobiology of flatworms to understand the psychology of aliens.”
I vehemently disagree. I am also concerned about a community which (seems to) foster such sentiment.
The problem is not that you can “just meditate and come to good conclusions”, the problem is that “technical knowledge about actual machine learning results” doesn’t seem like good path either.
Like, we can get from NN trained to do modular addition the fact that it performs Fourier transform, because we clearly know what Fourier transform is, but I don’t see any clear path to get from neural network the fact that its output is both useful and safe, because we don’t have any practical operationalization of what “useful and safe” is. If we had solution to MIRI problem “which program being run on infinitely large computer produces aligned outcome”, we could try to understand how good NN in approximating this program, using aforementioned technical knowledge, and have substantial hope, for example.
I think the answer to the question of how well realistic NN-like systems with finite compute approximate the results of hypothetical utility maximizers with infinite compute is “not very well at all”.
So the MIRI train of thought, as I understand it, goes something like
You cannot predict the specific moves that a superhuman chess-playing AI will make, but you can predict that the final board state will be one in which the chess-playing AI has won.
The chess AI is able to do this because it can accurately predict the likely outcomes of its own actions, and so is able to compute the utility of each of its possible actions and then effectively do an
argmax
over them to pick the best one, which results in the best outcome according to its utility function.Similarly, you will not be able to predict the specific actions that a “sufficiently powerful” utility maximizer will make, but you can predict that its utility function will be maximized.
For most utility functions about things in the real world, the configuration of matter that maximizes that utility function is not a configuration of matter that supports human life.
Actual future AI systems that will show up in the real world in the next few decades will be “sufficiently powerful” utility maximizers, and so this is a useful and predictive model of what the near future will look like.
I think the last few years in ML have made points 2 and 5 look particularly shaky here. For example, the actual architecture of the SOTA chess-playing systems doesn’t particularly resemble a cheaper version of the optimal-with-infinite-compute thing of “minmax over tree search”, but instead seems to be a different thing of “pile a bunch of situation-specific heuristics on top of each other, and then tweak the heuristics based on how well they do in practice”.
Which, for me at least, suggests that looking at what the optimal-with-infinite-compute thing would do might not be very informative for what actual systems which will show up in the next few decades will do.
I think this is a pretty straw characterization of the opposing viewpoint (or at least my own view), which is that intuitions about advanced AI systems should come from a wide variety of empirical domains and sources, and a focus on current-paradigm ML research is overly narrow.
Research and lessons from fields like game theory, economics, computer security, distributed systems, cognitive psychology, business, history, and more seem highly relevant to questions about what advanced AI systems will look like. I think the original Sequences and much of the best agent foundations research is an attempt to synthesize the lessons from these fields into a somewhat unified (but often informal) theory of the effects that intelligent, autonomous systems have on the world around us, through the lens of rationality, reductionism, empiricism, etc.
And whether or not you think they succeeded at that synthesis at all, humans are still the sole example of systems capable of having truly consequential and valuable effects of any kind. So I think it makes sense for the figure of merit for such theories and worldviews to be based on how well they explain these effects, rather than focusing solely or even mostly on how well they explain relatively narrow results about current ML systems.
Context for my original comment: I think that the key thing we want to do is predict the generalization of future neural networks. What will they do in what situations?
For some reason, certain people think that pretraining will produce consequentialist inner optimizers. This is generally grounded out as a highly specific claim about the functions implemented by most low-loss parameterizations of somewhat-unknown future model architectures trained on somewhat-unknown data distributions.
I am in particular thinking about “Playing the training game” reasoning, which is—at its core—extremely speculative and informal claims about inductive biases / the functions implemented by such parameterizations. If a person (like myself pre-2022) is talking about how AIs “might play the training game”, but also this person doesn’t know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned. To put it mildly.
Given that clarification which was not present in the original comment,
I disagree on game theory, econ, computer security, business, and history; those seem totally irrelevant for reasoning about inductive biases (and you might agree). However they seem useful for reasoning about the impact of AI on society as it becomes integrated.
Agree very weakly on distributed systems and moderately on cognitive psychology. (I have in fact written a post on the latter: Humans provide an untapped wealth of evidence about alignment.)
Flagging that this is one of the main claims which we seem to dispute; I do not concede this point FWIW.
It’s not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.
My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don’t need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there.
I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don’t know what training/design process would get us to AGI. Which means we can’t make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, which means the abstract often-intuitive arguments from other fields do have relevant things to say.
And I’m not seeing a lot of ironclad arguments that favour “pretraining + RLHF is going to get us to AGI” over “pretraining + RLHF is not going to get us to AGI”. The claim that e. g. shard theory generalizes to AGI is at least as tenuous as the claim that it doesn’t.
I’d be interested if you elaborated on that.
Thanks for pointing out that distinction!
I actually agree that a lot of reasoning about e.g. the specific pathways by which neural networks trained via SGD will produce consequentialists with catastrophically misaligned goals is often pretty weak and speculative, including in highly-upvoted posts like Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.
But to expand on my first comment, when I look around and see any kind of large effect on the world, good or bad (e.g. a moral catastrophe, a successful business, strong optimization around a MacGuffin), I can trace the causality through a path that is invariably well-modeled by applying concepts like expected utility theory (or geometric rationality, if you prefer), consequentialism, deception, Goodharting, maximization, etc. to the humans involved.
I read Humans provide an untapped wealth of evidence about alignment and much of your other writing as disagreeing with the (somewhat vague / general) claim that these concepts are really so fundamental, and that you think wielding them to speculate about future AI systems is privileging the hypothesis or otherwise frequently leads people astray. (Roughly accurate summary of your own views?)
Regardless of how well this describes your actual views or not, I think differing answers to the question of how fundamental this family of concepts is, and what kind of reasoning mistakes people typically make when they apply them to AI, is not really a disagreement about neural networks specifically or even AI generally.
These seem most useful if you expect complex multi-agent training in the future. But even if not I wouldn’t write them off entirely, given the existence of complex systems theory being a connection between them all & nn training (except computer security). For similar reasons, biology, neuroscience, and statistical & condensed matter (& other sorts of chaotic) physics start to seem useful.
“But also this person doesn’t know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”
Have you written about this anywhere?
Two days ago I argued that GPTs would not be an existential risk with someone no matter how extremely they were scaled up, and eventually it turned out that they took the adjectives “generative pretrained” to be separable descriptors, whereas I took them to refer to a narrow specific training method.
These statements are not necessarily (at least by themselves; possibly additional context is missing) examples of discussion about what happens “in the limit of ML training”, as these people may be concerned about the limit of ML architecture development rather than simply training.
Why do you vehemently disagree?
(As an obvious corollary, I myself was misguided to hold a similar belief pre-2022.)