one can just meditate on abstract properties of “advanced systems” and come to good conclusions about unknown results “in the limit of ML training”
I think this is a pretty straw characterization of the opposing viewpoint (or at least my own view), which is that intuitions about advanced AI systems should come from a wide variety of empirical domains and sources, and a focus on current-paradigm ML research is overly narrow.
Research and lessons from fields like game theory, economics, computer security, distributed systems, cognitive psychology, business, history, and more seem highly relevant to questions about what advanced AI systems will look like. I think the original Sequences and much of the best agent foundations research is an attempt to synthesize the lessons from these fields into a somewhat unified (but often informal) theory of the effects that intelligent, autonomous systems have on the world around us, through the lens of rationality, reductionism, empiricism, etc.
And whether or not you think they succeeded at that synthesis at all, humans are still the sole example of systems capable of having truly consequential and valuable effects of any kind. So I think it makes sense for the figure of merit for such theories and worldviews to be based on how well they explain these effects, rather than focusing solely or even mostly on how well they explain relatively narrow results about current ML systems.
Context for my original comment: I think that the key thing we want to do is predict the generalization of future neural networks. What will they do in what situations?
For some reason, certain people think that pretraining will produce consequentialist inner optimizers. This is generally grounded out as a highly specific claim about the functions implemented by most low-loss parameterizations of somewhat-unknown future model architectures trained on somewhat-unknown data distributions.
I am in particular thinking about “Playing the training game” reasoning, which is—at its core—extremely speculative and informal claims about inductive biases / the functions implemented by such parameterizations. If a person (like myself pre-2022) is talking about how AIs “might play the training game”, but also this person doesn’t know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucialfor reasoning about inductive biases), then I become extremely concerned. To put it mildly.
Given that clarification which was not present in the original comment,
I disagree on game theory, econ, computer security, business, and history; those seem totally irrelevant for reasoning about inductive biases (and you might agree). However they seem useful for reasoning about the impact of AI on society as it becomes integrated.
I think that the key thing we want to do is predict the generalization of future neural networks.
It’s not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.
My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don’t need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there.
I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don’t know what training/design process would get us to AGI. Which means we can’t make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, which means the abstract often-intuitive arguments from other fields do have relevant things to say.
And I’m not seeing a lot of ironclad arguments that favour “pretraining + RLHF is going to get us to AGI” over “pretraining + RLHF is not going to get us to AGI”. The claim that e. g. shard theory generalizes to AGI is at least as tenuous as the claim that it doesn’t.
Flagging that this is one of the main claims which we seem to dispute; I do not concede this point FWIW.
It’s not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.
But to expand on my first comment, when I look around and see any kind of large effect on the world, good or bad (e.g. a moral catastrophe, a successful business, strong optimization around a MacGuffin), I can trace the causality through a path that is invariably well-modeled by applying concepts like expected utility theory (or geometric rationality, if you prefer), consequentialism, deception, Goodharting, maximization, etc. to the humans involved.
I read Humans provide an untapped wealth of evidence about alignment and much of your other writing as disagreeing with the (somewhat vague / general) claim that these concepts are really so fundamental, and that you think wielding them to speculate about future AI systems is privileging the hypothesis or otherwise frequently leads people astray. (Roughly accurate summary of your own views?)
Regardless of how well this describes your actual views or not, I think differing answers to the question of how fundamental this family of concepts is, and what kind of reasoning mistakes people typically make when they apply them to AI, is not really a disagreement about neural networks specifically or even AI generally.
game theory, econ, computer security, business, and history
These seem most useful if you expect complex multi-agent training in the future. But even if not I wouldn’t write them off entirely, given the existence of complex systems theory being a connection between them all & nn training (except computer security). For similar reasons, biology, neuroscience, and statistical & condensed matter (& other sorts of chaotic) physics start to seem useful.
“But also this person doesn’t know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”
I think this is a pretty straw characterization of the opposing viewpoint (or at least my own view), which is that intuitions about advanced AI systems should come from a wide variety of empirical domains and sources, and a focus on current-paradigm ML research is overly narrow.
Research and lessons from fields like game theory, economics, computer security, distributed systems, cognitive psychology, business, history, and more seem highly relevant to questions about what advanced AI systems will look like. I think the original Sequences and much of the best agent foundations research is an attempt to synthesize the lessons from these fields into a somewhat unified (but often informal) theory of the effects that intelligent, autonomous systems have on the world around us, through the lens of rationality, reductionism, empiricism, etc.
And whether or not you think they succeeded at that synthesis at all, humans are still the sole example of systems capable of having truly consequential and valuable effects of any kind. So I think it makes sense for the figure of merit for such theories and worldviews to be based on how well they explain these effects, rather than focusing solely or even mostly on how well they explain relatively narrow results about current ML systems.
Context for my original comment: I think that the key thing we want to do is predict the generalization of future neural networks. What will they do in what situations?
For some reason, certain people think that pretraining will produce consequentialist inner optimizers. This is generally grounded out as a highly specific claim about the functions implemented by most low-loss parameterizations of somewhat-unknown future model architectures trained on somewhat-unknown data distributions.
I am in particular thinking about “Playing the training game” reasoning, which is—at its core—extremely speculative and informal claims about inductive biases / the functions implemented by such parameterizations. If a person (like myself pre-2022) is talking about how AIs “might play the training game”, but also this person doesn’t know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned. To put it mildly.
Given that clarification which was not present in the original comment,
I disagree on game theory, econ, computer security, business, and history; those seem totally irrelevant for reasoning about inductive biases (and you might agree). However they seem useful for reasoning about the impact of AI on society as it becomes integrated.
Agree very weakly on distributed systems and moderately on cognitive psychology. (I have in fact written a post on the latter: Humans provide an untapped wealth of evidence about alignment.)
Flagging that this is one of the main claims which we seem to dispute; I do not concede this point FWIW.
It’s not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.
My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don’t need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there.
I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don’t know what training/design process would get us to AGI. Which means we can’t make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, which means the abstract often-intuitive arguments from other fields do have relevant things to say.
And I’m not seeing a lot of ironclad arguments that favour “pretraining + RLHF is going to get us to AGI” over “pretraining + RLHF is not going to get us to AGI”. The claim that e. g. shard theory generalizes to AGI is at least as tenuous as the claim that it doesn’t.
I’d be interested if you elaborated on that.
Thanks for pointing out that distinction!
I actually agree that a lot of reasoning about e.g. the specific pathways by which neural networks trained via SGD will produce consequentialists with catastrophically misaligned goals is often pretty weak and speculative, including in highly-upvoted posts like Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.
But to expand on my first comment, when I look around and see any kind of large effect on the world, good or bad (e.g. a moral catastrophe, a successful business, strong optimization around a MacGuffin), I can trace the causality through a path that is invariably well-modeled by applying concepts like expected utility theory (or geometric rationality, if you prefer), consequentialism, deception, Goodharting, maximization, etc. to the humans involved.
I read Humans provide an untapped wealth of evidence about alignment and much of your other writing as disagreeing with the (somewhat vague / general) claim that these concepts are really so fundamental, and that you think wielding them to speculate about future AI systems is privileging the hypothesis or otherwise frequently leads people astray. (Roughly accurate summary of your own views?)
Regardless of how well this describes your actual views or not, I think differing answers to the question of how fundamental this family of concepts is, and what kind of reasoning mistakes people typically make when they apply them to AI, is not really a disagreement about neural networks specifically or even AI generally.
These seem most useful if you expect complex multi-agent training in the future. But even if not I wouldn’t write them off entirely, given the existence of complex systems theory being a connection between them all & nn training (except computer security). For similar reasons, biology, neuroscience, and statistical & condensed matter (& other sorts of chaotic) physics start to seem useful.
“But also this person doesn’t know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”
Have you written about this anywhere?