Quintin, in case you are reading this, I just wanna say that the link you give to justify
I think your intuition about how SGD works is wildly wrong. E.g., SGD doesn’t do anything like “randomly sample from the set of all low loss NN parameter configurations”. https://arxiv.org/abs/2110.00683
really doesn’t do nearly enough to justify your bold “wildly wrong” claim. First of all, it’s common for papers to overclaim, this seems like the sort of paper that could turn out to be basically just flat wrong. (I lack the expertise to decide for myself, it would take me many hours of reading the paper and talking to people probably). Secondly, even if I assume the paper is correct, it just shows that the simplicity bias of SGD on NNs is different than some people think—it is weighted towards broad basins / connected regions. It’s still randomly sampling from the set of all low loss NN parameter configurations, but with a different bias/prior. (Unless you can argue that this specific different bias leads to the consequences/conclusions you like, and in particular leads to doom being much less likely. Maybe you can, I’d like to see that.)
SGD has a strong inherent simplicity bias, even without weight regularization, and this is fairly well known in DL literature (I could probably find hundreds of examples if I had the time—I do not). By SGD I specifically mean SGD variants that don’t use a 2nd order approx (such as Adam). The are many papers which find approx 2nd-order variance adjusted optimizers like Adam have various generalization/overfitting issues compared to SGD, this comes up over and over, such that it’s fairly common to use some additional regularization with Adam.
It’s also pretty intuitively obvious why SGD has a strong simplicity prior if you just think through some simple examples—as SGD doesn’t move in the direction that minimizes loss, it moves in the parsimonious direction which minimizes loss per unit weight distance (moved away from the init). 2nd order optimizers like adam can move more directly in the direction of lower loss.
Empirically, the inductive bias that you get when you train with SGD, and similar optimisers, is in fact quite similar to the inductive bias that you would get, if you were to repeatedly re-initialise a neural network until you randomly get a set of weights that yield a low loss. Which optimiser you use does have an effect as well, but this is very small by comparison. See this paper.
Yes. (Note that “randomly sample from the set of all low loss NN parameter configurations” goes hand in hand with there being a bias towards simplicity, it’s not a contradiction. Is that maybe what’s going on here—people misinterpreted Bensinger as somehow not realizing simpler configurations are more likely?)
My prior is that DL has a great amount of wierd domain knowledge which is mysterious to those who haven’t spent years studying it, and years studying DL correlates with strong disagreement with the sequences/MIRI positions in many fundamentals. I trace all this back to EY over-updating too much on ev psych and not reading enough neuroscience and early DL.
So anyway, a sentence like “randomly sample from the set of all low loss NN parameter configurations” is not one I would use or expect a DL-insider to use and sounds more like something a MIRI/LW person would say—in part yes because I don’t generally expect MIRI/LW folks to be especially aware of the intrinsic SGD simplicity prior. The more correct statement is “randomly sample from the set of all simple low loss configs” or similar.
But it’s also not quite clear to me how relevant that subpoint is, just sharing my impression.
IMO this seems like a strawman. When talking to MIRI people it’s pretty clear they have thought a good amount about the inductive biases of SGD, including an associated simplicity prior.
Sure it will clearly be a strawman for some individuals—the point of my comment is to explain how someone like myself could potentially misinterpret Bensinger and why. (As I don’t know him very well, my brain models him as a generic MIRI/LW type)
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like “invent fast-running whole-brain emulation”, then hitting a button to execute the plan would kill all humans, with very high probability.
(emphasis mine)
That sounds a whole lot like it’s invoking a simplicity prior to me!
Note I didn’t actually reply to that quote. Sure that’s an explicit simplicity prior. However there’s a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
There are more papers and math in this broad vein (e.g. Mingard on SGD, Singular learning theory) , and I roughly buy the main thrust of their conclusions[1].
However, I think “randomly sample from the space of solutions with low combined complexity&calculation cost” doesn’t actually help us that much over a pure “randomly sample” when it comes to alignment.
It could mean that the relation between your network’s learned goals and the loss function is more straightforward than what you get with evolution=>human hardcoded brain stem=>human goals, since the later likely has a far weaker simplicity bias in the first step than the network training does. But the second step, a human baby training on their brain stem loss signal, seems to remain a useful reference point for the amount of messiness we can expect. And it does not seem to me to be a comforting one. I for one, don’t consider getting excellent visual cortex prediction scores a central terminal goal of mine.
Though I remain unsure of what to make of the specific one Quintin cites, which advances some more specific claims inside this broad category, and is based on results from a toy model with weird, binary NNs, using weird, non-standard activation functions.
OHHH I think there’s just an error of reading comprehension/charitability here. “Randomly sample” doesn’t mean without a simplicity bias—obviously there’s a bias towards simplicity, that just falls out of the math pretty much. I think Quintin (and maybe you too Lucius and Jacob) were probably just misreading Rob Bensinger’s claim as implying something he didn’t mean to imply. (I bet if we ask Rob “when you said randomly sample, did you mean there isn’t a bias towards simplicity?” he’ll say “no”)
I feel like there’s a significant distance between what’s being said formally versus the conclusions being drawn. From Rob:
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language)
From you:
the simplicity bias of SGD on NNs is different than some people think—it is weighted towards broad basins / connected regions. It’s still randomly sampling from the set of all low loss NN parameter configurations, but with a different bias/prior.
The issue is that literally any plan generation / NN training process can be described in either manner, regardless of the actual prior involved. In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.
It’s not clear to me what specific priors Rob has in mind for the “random plan” sampling process, unless by “extant formal language” he literally means “formal language that currently exists right now”, in which case:
Why should this be a good description of what SGD does?
Why should this be a better description of what SGD does, as compared to what human learning does?
I think I am comfortable calling this intuition “wildly wrong”, and it seems correct to say that the cited paper is evidence against such a prior, since that paper suggests a geometry-based inductive bias stemming from the parameter-wise clustering of solutions, which I doubt the solution spaces of current formal languages reflect in a similar manner to the parameter space of current NNs.
Properly arguing that biological neurons and artificial NNs converge in their inductive biases would be an entire post, though I do think there’s quite a bit of evidence in that direction, some of which I cited in my Twitter thread. Maybe I’ll start writing that post, though I currently have lots of other stuff to do.
Although, I expect my conclusion would be something like “there’s a bunch of evidence and argument both ways, with IMO a small/moderate advantage for the ‘convergence’ side, but no extreme position is warranted, and the implications for alignment are murky anyways”, so maybe I shouldn’t bother? What do you think?
In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.
Isn’t it enough that they do differ? Why do we need to be able to accurately/precisely characterize the nature of the difference, to conclude that an arbitrary inductive bias different from our own is unlikely to sample the same kinds of plans we do?
That’s not at all clear to me. Inductive biases clearly differ between humans, yet we are not all terminally misaligned with each other. E.g., split brain patients are not all wired value aliens, despite a significant difference in architecture. Also, training on human-originated data causes networks to learn human-like inductive biases (at least somewhat).
Thanks for weighing in Quintin! I think I basically agree with dxu here. I think this discussion shows that Rob should probably rephrase his argument as something like “When humans make plans, the distribution they sample from has all sorts of unique and interesting properties that arise from various features of human biology and culture and the interaction between them. Big artificial neural nets will lack these features, so the distribution they draw from will be significantly different—much bigger than the difference between any two humans, for example. This is reason to expect doom, because of instrumental convergence...”
I take your point that the differences between humans seem… not so large… though actually I guess a lot of people would argue the opposite and say that many humans are indeed terminally misaligned with many other humans.
I also take the point about human-originated data hopefully instilling human-like inductive biases.
But IMO the burden of proof is firmly on the side of whoever wants to say that therefore things will probably be fine, rather than the person who is running around screaming expecting doom. The AIs we are building are going to be more alien than literal aliens, it seems. (The ray of hope here is the massive training on human-generated data, but again, I’d want to see this more carefully argued here, otherwise it seems like just wishful thinking.)
ETA: Yes, I for one would be quite interested to read a post by you about why biological neurons and artificial NN’s should be expected to converge in their inductive biases, with discussion of their implications for alignment.
There are differences between ANNs and BNNs but they don’t matter that much—LLMs converge to learn the same internal representations as linguistic cortex anyway.
When humans make plans, the distribution they sample from has all sorts of unique and interesting properties that arise from various features of human biology and culture and the interaction between them. Big artificial neural nets will lack these features, so the distribution they draw from will be significantly different
LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.
Moravec was absolutely correct to use the term ‘mind children’ and all that implies. I outlined the case why the human brain and DL systems are essentially the same way way back in 2015 and every year since we have accumulated further confirming evidence. The closely related scaling hypothesis—predicted in that post—was extensively tested by openAI and worked at least as well as I predicted/expected, taking us to the brink of AGI.
LLMs:
learn very much like the cortex, converging to the same internal representations
acquire the same human cognitive biases and limitations
predictably develop human like cognitive abilities with scale
are extremely human, not alien at all
That doesn’t make them automatically safe, but they are not potentially unsafe because they are alien.
LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.
This argument proves too much. A Solomonoff inductor (AIXI) running on a hypercomputer would also “learn from basically the same data” (sensory data produced by the physical universe) with “similar training objectives” (predict the next bit of sensory information) using “universal approximations of Bayesian inference” (a perfect approximation, in this case), and yet it would not be the case that you could then conclude that AIXI “learns very similar internal functions/models”. (In fact, the given example of AIXI is much closer to Rob’s initial description of “sampling from the space of possible plans, weighted by length”!)
In order to properly argue this, you need to talk about more than just training objectives and approximations to Bayes; you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use. Currently, I’m not aware of any investigations into this that I’d consider satisfactory.
(Note here that I’ve skimmed the papers you cite in your linked posts, and for most of them it seems to me either (a) they don’t make the kinds of claims you’d need to establish a strong conclusion of “therefore, AI systems think like humans”, or (b) they do make such claims, but then the described investigation doesn’t justify those claims.)
Full Solomon Induction on a hypercomputer absolutely does not just “learn very similar internal functions models”, it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use.
This has been ongoing for over a decade or more (dating at least back to Sparse Coding as an explanation for V1).
But I will agree the bigger LLMs are now in a somewhat different territory—more like human cortices trained for millennia, perhaps ten millennia for GPT4.
Full Solomon Induction on a hypercomputer absolutely does not just “learn very similar internal functions models”, it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
...yes? And this is obviously very, very different from how humans represent things internally?
I mean, for one thing, humans don’t recreate exact simulations of other humans in our brains (even though “predicting other humans” is arguably the high-level cognitive task we are most specced for doing). But even setting that aside, the Solomonoff inductor’s hypothesis also contains a bunch of stuff other than human brains, modeled in full detail—which again is not anything close to how humans model the world around us.
I admit to having some trouble following your (implicit) argument here. Is it that, because a Solomonoff inductor is capable of simulating humans, that makes it “human-like” in some sense relevant to alignment? (Specifically, that doing the plan-sampling thing Rob mentioned in the OP with a Solomonoff inductor will get you a safe result, because it’ll be “humans in other universes” writing the plans? If so, I don’t see how that follows at all; I’m pretty sure having humans somewhere inside of your model doesn’t mean that that part of your model is what ends up generating the high-level plans being sampled by the outer system.)
It really seems to me that if I accept what looks to me like your argument, I’m basically forced to conclude that anything with a simplicity prior (trained on human data) will be aligned, meaning (in turn) the orthogonality thesis is completely false. But… well, I obviously don’t buy that, so I’m puzzled that you seem to be stressing this point (in both this comment and other comments, e.g. this reply to me elsethread):
Note I didn’t actually reply to that quote. Sure that’s an explicit simplicity prior. However there’s a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
(to be clear, my response to this is basically everything I wrote above; this is not meant as its own separate quote-reply block)
you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use.
This has been ongoing for over a decade or more (dating at least back to Sparse Coding as an explanation for V1).
That’s not what I mean by “internal representations”. I’m referring to the concepts learned by the model, and whether analogues for those concepts exist in human thought-space (and if so, how closely they match each other). It’s not at all clear to me that this occurs by default, and I don’t think the fact that there are some statistical similarities between the high-level encoding approaches being used means that similar concepts end up being converged to. (Which is what is relevant, on my model, when it comes to questions like “if you sample plans from this system, what kinds of plans does it end up outputting, and do they end up being unusually dangerous relative to the kinds of plans humans tend to sample?”)
I agree that sparse coding as an approach seems to have been anticipated by evolution, but your raising this point (and others like it), seemingly as an argument that this makes systems more likely to be aligned by default, feels thematically similar to some of my previous objections—which (roughly) is that you seem to be taking a fairly weak premise (statistical learning models likely have some kind of simplicity prior built in to their representation schema) and running with that premise wayyy further than I think is licensed—running, so far as I can tell, directly to the absolute edge of plausibility, with a conclusion something like “And therefore, these systems will be aligned.” I don’t think the logical leap here has been justified!
I think we are starting to talk past each other, so let me just summarize my position (and what i’m not arguing):
1.) ANNs and BNNs converge in their internal representations, in part because of how physics only permits a narrow pareto efficient solution set, but also because ANNs are literally trained as distillations of BNNs. (More well known/accepted now, but I argued/predicted this well in advance (at least as early as 2015)).
2.) Because of 1.), there is no problem with ‘alien thoughts’ based on mindspace geometry. That was just never going to be a problem.
3.) Neither 1 or 2 are sufficient for alignment by default—both points apply rather obviously to humans, who are clearly not aligned by default with other humans or humanity in general.
Earlier you said:
A Solomonoff inductor (AIXI) running on a hypercomputer would also “learn from basically the same data” (sensory data produced by the physical universe) with “similar training objectives” (predict the next bit of sensory information) using “universal approximations of Bayesian inference” (a perfect approximation, in this case), and yet it would not be the case that you could then conclude that AIXI “learns very similar internal functions/models”.
I then pointed out that full SI on a hypercomputer would result in recreating entire worlds with human minds, but that was a bit of a tangent. The more relevant point is more nuanced: AIXI is SI plus some reward function. So all different possible AIXI agents share the exact same world model, yet they have different reward functions and thus would generate different plans and may well end up killing each other or something.
So having exactly the same world model is not sufficient for alignment—I’m not and would never argue that
But if you train a LLM to distill human thought sequences, those thought sequences can implicitly contain plans, value judgements or the equivalents. Thus LLMs can naturally align to human values to varying degrees, merely through their training as distillations of human thought. This of course by itself doesn’t guarantee alignment, but it is a much more hopeful situation to be in, because you can exert a great deal of control through control of the training data.
It’s all relative. “Are extremely human, not alien at all” --> Are you seriously saying that e.g. if and when we one day encounter aliens on another planet, the kind of aliens smart enough to build an industrial civilization, they’ll be more alien than LLMs? (Well, obviously they won’t have been trained on the human Internet. So let’s imagine we took a whole bunch of them as children and imported them to Earth and raised them in some crazy orphanage where they were forced to watch TV and read the internet and play various video games all day.)
Because I instead say that all your arguments about similar learning algorithms, similar cognitive biases, etc. will apply even more strongly (in expectation) to these hypothetical aliens capable of building industrial civilization. So the basic relationship of humans<aliens<LLMs will still hold; LLMs will still be more alien than aliens.
Are you seriously saying that e.g. if and when we one day encounter aliens on another planet, the kind of aliens smart enough to build an industrial civilization, they’ll be more alien than LLMs?
Yes! obviously more alien than our LLMs. LLMs are distillations of aggregated human linguistic cortices. Anytime you train one network on the output of others, you clone distill the original(s)! The algorithmic content of NNs is determined by the training data, and the data here in question is human thought.
This was always the way it was going to be, this was all predicted long in advance by the systems/cybernetics futurists like Moravec—AI was/will be our mind children.
EY misled many people here with the bad “human mindspace is narrow meme”, I mostly agree with Quintin’s recent takedown, but I of course also objected way back when.
I really don’t buy this. To be clear: Your answer is Yes, including in the variant case I proposed in parentheses, where the aliens were taken as children and raised in a crazy Earth orphanage?
I didn’t notice the part in parentheses at all until just now—added in edit? The edit really doesn’t agree with the original question to me.
If you took alien children and raised them as earthlings you’d get mostly earthlings in alien bodies—given some assumptions they had roughly similar sized brains and reasonably parallel evolution. Something like this has happened historically—when uncontacted tribal children are raised in a distant advanced civ for example. Western culture—WIERD—has so pervasively colonized and conquered much of the memetic landscape that we have forgotten how diverse human mindspace can be (in some sense it could be WIERD that was the alien invasion ).
Also more locally on earth: japanese culture is somewhat alien compared to western english/american culture. I expect actual alien culture to be more alien.
I don’t necessarily agree—as I don’t consider either to be very alien. Minds are software memetic constructs so you are just comparing human software running on GPUs vs human software running on alien brains. How different that is and which is more different than human software running on ape brains now depends on many cumbersome details.
Quintin, in case you are reading this, I just wanna say that the link you give to justify
really doesn’t do nearly enough to justify your bold “wildly wrong” claim. First of all, it’s common for papers to overclaim, this seems like the sort of paper that could turn out to be basically just flat wrong. (I lack the expertise to decide for myself, it would take me many hours of reading the paper and talking to people probably). Secondly, even if I assume the paper is correct, it just shows that the simplicity bias of SGD on NNs is different than some people think—it is weighted towards broad basins / connected regions. It’s still randomly sampling from the set of all low loss NN parameter configurations, but with a different bias/prior. (Unless you can argue that this specific different bias leads to the consequences/conclusions you like, and in particular leads to doom being much less likely. Maybe you can, I’d like to see that.)
SGD has a strong inherent simplicity bias, even without weight regularization, and this is fairly well known in DL literature (I could probably find hundreds of examples if I had the time—I do not). By SGD I specifically mean SGD variants that don’t use a 2nd order approx (such as Adam). The are many papers which find approx 2nd-order variance adjusted optimizers like Adam have various generalization/overfitting issues compared to SGD, this comes up over and over, such that it’s fairly common to use some additional regularization with Adam.
It’s also pretty intuitively obvious why SGD has a strong simplicity prior if you just think through some simple examples—as SGD doesn’t move in the direction that minimizes loss, it moves in the parsimonious direction which minimizes loss per unit weight distance (moved away from the init). 2nd order optimizers like adam can move more directly in the direction of lower loss.
Empirically, the inductive bias that you get when you train with SGD, and similar optimisers, is in fact quite similar to the inductive bias that you would get, if you were to repeatedly re-initialise a neural network until you randomly get a set of weights that yield a low loss. Which optimiser you use does have an effect as well, but this is very small by comparison. See this paper.
Yes. (Note that “randomly sample from the set of all low loss NN parameter configurations” goes hand in hand with there being a bias towards simplicity, it’s not a contradiction. Is that maybe what’s going on here—people misinterpreted Bensinger as somehow not realizing simpler configurations are more likely?)
My prior is that DL has a great amount of wierd domain knowledge which is mysterious to those who haven’t spent years studying it, and years studying DL correlates with strong disagreement with the sequences/MIRI positions in many fundamentals. I trace all this back to EY over-updating too much on ev psych and not reading enough neuroscience and early DL.
So anyway, a sentence like “randomly sample from the set of all low loss NN parameter configurations” is not one I would use or expect a DL-insider to use and sounds more like something a MIRI/LW person would say—in part yes because I don’t generally expect MIRI/LW folks to be especially aware of the intrinsic SGD simplicity prior. The more correct statement is “randomly sample from the set of all simple low loss configs” or similar.
But it’s also not quite clear to me how relevant that subpoint is, just sharing my impression.
IMO this seems like a strawman. When talking to MIRI people it’s pretty clear they have thought a good amount about the inductive biases of SGD, including an associated simplicity prior.
Sure it will clearly be a strawman for some individuals—the point of my comment is to explain how someone like myself could potentially misinterpret Bensinger and why. (As I don’t know him very well, my brain models him as a generic MIRI/LW type)
I want to revisit what Rob actually wrote:
(emphasis mine)
That sounds a whole lot like it’s invoking a simplicity prior to me!
Note I didn’t actually reply to that quote. Sure that’s an explicit simplicity prior. However there’s a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
There are more papers and math in this broad vein (e.g. Mingard on SGD, Singular learning theory) , and I roughly buy the main thrust of their conclusions[1].
However, I think “randomly sample from the space of solutions with low combined complexity&calculation cost” doesn’t actually help us that much over a pure “randomly sample” when it comes to alignment.
It could mean that the relation between your network’s learned goals and the loss function is more straightforward than what you get with evolution=>human hardcoded brain stem=>human goals, since the later likely has a far weaker simplicity bias in the first step than the network training does. But the second step, a human baby training on their brain stem loss signal, seems to remain a useful reference point for the amount of messiness we can expect. And it does not seem to me to be a comforting one. I for one, don’t consider getting excellent visual cortex prediction scores a central terminal goal of mine.
Though I remain unsure of what to make of the specific one Quintin cites, which advances some more specific claims inside this broad category, and is based on results from a toy model with weird, binary NNs, using weird, non-standard activation functions.
OHHH I think there’s just an error of reading comprehension/charitability here. “Randomly sample” doesn’t mean without a simplicity bias—obviously there’s a bias towards simplicity, that just falls out of the math pretty much. I think Quintin (and maybe you too Lucius and Jacob) were probably just misreading Rob Bensinger’s claim as implying something he didn’t mean to imply. (I bet if we ask Rob “when you said randomly sample, did you mean there isn’t a bias towards simplicity?” he’ll say “no”)
I didn’t think Rob was necessarily implying that. I just tried to give some context to Quintin’s objection.
I feel like there’s a significant distance between what’s being said formally versus the conclusions being drawn. From Rob:
From you:
The issue is that literally any plan generation / NN training process can be described in either manner, regardless of the actual prior involved. In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.
It’s not clear to me what specific priors Rob has in mind for the “random plan” sampling process, unless by “extant formal language” he literally means “formal language that currently exists right now”, in which case:
Why should this be a good description of what SGD does?
Why should this be a better description of what SGD does, as compared to what human learning does?
I think I am comfortable calling this intuition “wildly wrong”, and it seems correct to say that the cited paper is evidence against such a prior, since that paper suggests a geometry-based inductive bias stemming from the parameter-wise clustering of solutions, which I doubt the solution spaces of current formal languages reflect in a similar manner to the parameter space of current NNs.
Properly arguing that biological neurons and artificial NNs converge in their inductive biases would be an entire post, though I do think there’s quite a bit of evidence in that direction, some of which I cited in my Twitter thread. Maybe I’ll start writing that post, though I currently have lots of other stuff to do.
Although, I expect my conclusion would be something like “there’s a bunch of evidence and argument both ways, with IMO a small/moderate advantage for the ‘convergence’ side, but no extreme position is warranted, and the implications for alignment are murky anyways”, so maybe I shouldn’t bother? What do you think?
Isn’t it enough that they do differ? Why do we need to be able to accurately/precisely characterize the nature of the difference, to conclude that an arbitrary inductive bias different from our own is unlikely to sample the same kinds of plans we do?
That’s not at all clear to me. Inductive biases clearly differ between humans, yet we are not all terminally misaligned with each other. E.g., split brain patients are not all wired value aliens, despite a significant difference in architecture. Also, training on human-originated data causes networks to learn human-like inductive biases (at least somewhat).
Thanks for weighing in Quintin! I think I basically agree with dxu here. I think this discussion shows that Rob should probably rephrase his argument as something like “When humans make plans, the distribution they sample from has all sorts of unique and interesting properties that arise from various features of human biology and culture and the interaction between them. Big artificial neural nets will lack these features, so the distribution they draw from will be significantly different—much bigger than the difference between any two humans, for example. This is reason to expect doom, because of instrumental convergence...”
I take your point that the differences between humans seem… not so large… though actually I guess a lot of people would argue the opposite and say that many humans are indeed terminally misaligned with many other humans.
I also take the point about human-originated data hopefully instilling human-like inductive biases.
But IMO the burden of proof is firmly on the side of whoever wants to say that therefore things will probably be fine, rather than the person who is running around screaming expecting doom. The AIs we are building are going to be more alien than literal aliens, it seems. (The ray of hope here is the massive training on human-generated data, but again, I’d want to see this more carefully argued here, otherwise it seems like just wishful thinking.)
ETA: Yes, I for one would be quite interested to read a post by you about why biological neurons and artificial NN’s should be expected to converge in their inductive biases, with discussion of their implications for alignment.
There are differences between ANNs and BNNs but they don’t matter that much—LLMs converge to learn the same internal representations as linguistic cortex anyway.
LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.
Moravec was absolutely correct to use the term ‘mind children’ and all that implies. I outlined the case why the human brain and DL systems are essentially the same way way back in 2015 and every year since we have accumulated further confirming evidence. The closely related scaling hypothesis—predicted in that post—was extensively tested by openAI and worked at least as well as I predicted/expected, taking us to the brink of AGI.
LLMs:
learn very much like the cortex, converging to the same internal representations
acquire the same human cognitive biases and limitations
predictably develop human like cognitive abilities with scale
are extremely human, not alien at all
That doesn’t make them automatically safe, but they are not potentially unsafe because they are alien.
This argument proves too much. A Solomonoff inductor (AIXI) running on a hypercomputer would also “learn from basically the same data” (sensory data produced by the physical universe) with “similar training objectives” (predict the next bit of sensory information) using “universal approximations of Bayesian inference” (a perfect approximation, in this case), and yet it would not be the case that you could then conclude that AIXI “learns very similar internal functions/models”. (In fact, the given example of AIXI is much closer to Rob’s initial description of “sampling from the space of possible plans, weighted by length”!)
In order to properly argue this, you need to talk about more than just training objectives and approximations to Bayes; you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use. Currently, I’m not aware of any investigations into this that I’d consider satisfactory.
(Note here that I’ve skimmed the papers you cite in your linked posts, and for most of them it seems to me either (a) they don’t make the kinds of claims you’d need to establish a strong conclusion of “therefore, AI systems think like humans”, or (b) they do make such claims, but then the described investigation doesn’t justify those claims.)
Full Solomon Induction on a hypercomputer absolutely does not just “learn very similar internal functions models”, it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
This has been ongoing for over a decade or more (dating at least back to Sparse Coding as an explanation for V1).
But I will agree the bigger LLMs are now in a somewhat different territory—more like human cortices trained for millennia, perhaps ten millennia for GPT4.
...yes? And this is obviously very, very different from how humans represent things internally?
I mean, for one thing, humans don’t recreate exact simulations of other humans in our brains (even though “predicting other humans” is arguably the high-level cognitive task we are most specced for doing). But even setting that aside, the Solomonoff inductor’s hypothesis also contains a bunch of stuff other than human brains, modeled in full detail—which again is not anything close to how humans model the world around us.
I admit to having some trouble following your (implicit) argument here. Is it that, because a Solomonoff inductor is capable of simulating humans, that makes it “human-like” in some sense relevant to alignment? (Specifically, that doing the plan-sampling thing Rob mentioned in the OP with a Solomonoff inductor will get you a safe result, because it’ll be “humans in other universes” writing the plans? If so, I don’t see how that follows at all; I’m pretty sure having humans somewhere inside of your model doesn’t mean that that part of your model is what ends up generating the high-level plans being sampled by the outer system.)
It really seems to me that if I accept what looks to me like your argument, I’m basically forced to conclude that anything with a simplicity prior (trained on human data) will be aligned, meaning (in turn) the orthogonality thesis is completely false. But… well, I obviously don’t buy that, so I’m puzzled that you seem to be stressing this point (in both this comment and other comments, e.g. this reply to me elsethread):
(to be clear, my response to this is basically everything I wrote above; this is not meant as its own separate quote-reply block)
That’s not what I mean by “internal representations”. I’m referring to the concepts learned by the model, and whether analogues for those concepts exist in human thought-space (and if so, how closely they match each other). It’s not at all clear to me that this occurs by default, and I don’t think the fact that there are some statistical similarities between the high-level encoding approaches being used means that similar concepts end up being converged to. (Which is what is relevant, on my model, when it comes to questions like “if you sample plans from this system, what kinds of plans does it end up outputting, and do they end up being unusually dangerous relative to the kinds of plans humans tend to sample?”)
I agree that sparse coding as an approach seems to have been anticipated by evolution, but your raising this point (and others like it), seemingly as an argument that this makes systems more likely to be aligned by default, feels thematically similar to some of my previous objections—which (roughly) is that you seem to be taking a fairly weak premise (statistical learning models likely have some kind of simplicity prior built in to their representation schema) and running with that premise wayyy further than I think is licensed—running, so far as I can tell, directly to the absolute edge of plausibility, with a conclusion something like “And therefore, these systems will be aligned.” I don’t think the logical leap here has been justified!
I think we are starting to talk past each other, so let me just summarize my position (and what i’m not arguing):
1.) ANNs and BNNs converge in their internal representations, in part because of how physics only permits a narrow pareto efficient solution set, but also because ANNs are literally trained as distillations of BNNs. (More well known/accepted now, but I argued/predicted this well in advance (at least as early as 2015)).
2.) Because of 1.), there is no problem with ‘alien thoughts’ based on mindspace geometry. That was just never going to be a problem.
3.) Neither 1 or 2 are sufficient for alignment by default—both points apply rather obviously to humans, who are clearly not aligned by default with other humans or humanity in general.
Earlier you said:
I then pointed out that full SI on a hypercomputer would result in recreating entire worlds with human minds, but that was a bit of a tangent. The more relevant point is more nuanced: AIXI is SI plus some reward function. So all different possible AIXI agents share the exact same world model, yet they have different reward functions and thus would generate different plans and may well end up killing each other or something.
So having exactly the same world model is not sufficient for alignment—I’m not and would never argue that
But if you train a LLM to distill human thought sequences, those thought sequences can implicitly contain plans, value judgements or the equivalents. Thus LLMs can naturally align to human values to varying degrees, merely through their training as distillations of human thought. This of course by itself doesn’t guarantee alignment, but it is a much more hopeful situation to be in, because you can exert a great deal of control through control of the training data.
It’s all relative. “Are extremely human, not alien at all” --> Are you seriously saying that e.g. if and when we one day encounter aliens on another planet, the kind of aliens smart enough to build an industrial civilization, they’ll be more alien than LLMs? (Well, obviously they won’t have been trained on the human Internet. So let’s imagine we took a whole bunch of them as children and imported them to Earth and raised them in some crazy orphanage where they were forced to watch TV and read the internet and play various video games all day.)
Because I instead say that all your arguments about similar learning algorithms, similar cognitive biases, etc. will apply even more strongly (in expectation) to these hypothetical aliens capable of building industrial civilization. So the basic relationship of humans<aliens<LLMs will still hold; LLMs will still be more alien than aliens.
Yes! obviously more alien than our LLMs. LLMs are distillations of aggregated human linguistic cortices. Anytime you train one network on the output of others, you clone distill the original(s)! The algorithmic content of NNs is determined by the training data, and the data here in question is human thought.
This was always the way it was going to be, this was all predicted long in advance by the systems/cybernetics futurists like Moravec—AI was/will be our mind children.
EY misled many people here with the bad “human mindspace is narrow meme”, I mostly agree with Quintin’s recent takedown, but I of course also objected way back when.
Nice to see us getting down to cruxes.
I really don’t buy this. To be clear: Your answer is Yes, including in the variant case I proposed in parentheses, where the aliens were taken as children and raised in a crazy Earth orphanage?
I didn’t notice the part in parentheses at all until just now—added in edit? The edit really doesn’t agree with the original question to me.
If you took alien children and raised them as earthlings you’d get mostly earthlings in alien bodies—given some assumptions they had roughly similar sized brains and reasonably parallel evolution. Something like this has happened historically—when uncontacted tribal children are raised in a distant advanced civ for example. Western culture—WIERD—has so pervasively colonized and conquered much of the memetic landscape that we have forgotten how diverse human mindspace can be (in some sense it could be WIERD that was the alien invasion ).
Also more locally on earth: japanese culture is somewhat alien compared to western english/american culture. I expect actual alien culture to be more alien.
I’m pretty sure I didn’t edit it, I think that was there from the beginning.
OK, cool. So then you agree that LLMs will be more alien than aliens-who-were-raised-on-Earth-in-crazy-internet-text-pretraining-orphanage?
I don’t necessarily agree—as I don’t consider either to be very alien. Minds are software memetic constructs so you are just comparing human software running on GPUs vs human software running on alien brains. How different that is and which is more different than human software running on ape brains now depends on many cumbersome details.
How do we know that the human brain and LLMs converge to the same internal representations—is that addressed in your earlier write-up?
Yes—It was already known for vision back in that 2015 post, and in my later posts I revisit the issue here and later here