Full Solomon Induction on a hypercomputer absolutely does not just “learn very similar internal functions models”, it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
...yes? And this is obviously very, very different from how humans represent things internally?
I mean, for one thing, humans don’t recreate exact simulations of other humans in our brains (even though “predicting other humans” is arguably the high-level cognitive task we are most specced for doing). But even setting that aside, the Solomonoff inductor’s hypothesis also contains a bunch of stuff other than human brains, modeled in full detail—which again is not anything close to how humans model the world around us.
I admit to having some trouble following your (implicit) argument here. Is it that, because a Solomonoff inductor is capable of simulating humans, that makes it “human-like” in some sense relevant to alignment? (Specifically, that doing the plan-sampling thing Rob mentioned in the OP with a Solomonoff inductor will get you a safe result, because it’ll be “humans in other universes” writing the plans? If so, I don’t see how that follows at all; I’m pretty sure having humans somewhere inside of your model doesn’t mean that that part of your model is what ends up generating the high-level plans being sampled by the outer system.)
It really seems to me that if I accept what looks to me like your argument, I’m basically forced to conclude that anything with a simplicity prior (trained on human data) will be aligned, meaning (in turn) the orthogonality thesis is completely false. But… well, I obviously don’t buy that, so I’m puzzled that you seem to be stressing this point (in both this comment and other comments, e.g. this reply to me elsethread):
Note I didn’t actually reply to that quote. Sure that’s an explicit simplicity prior. However there’s a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
(to be clear, my response to this is basically everything I wrote above; this is not meant as its own separate quote-reply block)
you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use.
This has been ongoing for over a decade or more (dating at least back to Sparse Coding as an explanation for V1).
That’s not what I mean by “internal representations”. I’m referring to the concepts learned by the model, and whether analogues for those concepts exist in human thought-space (and if so, how closely they match each other). It’s not at all clear to me that this occurs by default, and I don’t think the fact that there are some statistical similarities between the high-level encoding approaches being used means that similar concepts end up being converged to. (Which is what is relevant, on my model, when it comes to questions like “if you sample plans from this system, what kinds of plans does it end up outputting, and do they end up being unusually dangerous relative to the kinds of plans humans tend to sample?”)
I agree that sparse coding as an approach seems to have been anticipated by evolution, but your raising this point (and others like it), seemingly as an argument that this makes systems more likely to be aligned by default, feels thematically similar to some of my previous objections—which (roughly) is that you seem to be taking a fairly weak premise (statistical learning models likely have some kind of simplicity prior built in to their representation schema) and running with that premise wayyy further than I think is licensed—running, so far as I can tell, directly to the absolute edge of plausibility, with a conclusion something like “And therefore, these systems will be aligned.” I don’t think the logical leap here has been justified!
I think we are starting to talk past each other, so let me just summarize my position (and what i’m not arguing):
1.) ANNs and BNNs converge in their internal representations, in part because of how physics only permits a narrow pareto efficient solution set, but also because ANNs are literally trained as distillations of BNNs. (More well known/accepted now, but I argued/predicted this well in advance (at least as early as 2015)).
2.) Because of 1.), there is no problem with ‘alien thoughts’ based on mindspace geometry. That was just never going to be a problem.
3.) Neither 1 or 2 are sufficient for alignment by default—both points apply rather obviously to humans, who are clearly not aligned by default with other humans or humanity in general.
Earlier you said:
A Solomonoff inductor (AIXI) running on a hypercomputer would also “learn from basically the same data” (sensory data produced by the physical universe) with “similar training objectives” (predict the next bit of sensory information) using “universal approximations of Bayesian inference” (a perfect approximation, in this case), and yet it would not be the case that you could then conclude that AIXI “learns very similar internal functions/models”.
I then pointed out that full SI on a hypercomputer would result in recreating entire worlds with human minds, but that was a bit of a tangent. The more relevant point is more nuanced: AIXI is SI plus some reward function. So all different possible AIXI agents share the exact same world model, yet they have different reward functions and thus would generate different plans and may well end up killing each other or something.
So having exactly the same world model is not sufficient for alignment—I’m not and would never argue that
But if you train a LLM to distill human thought sequences, those thought sequences can implicitly contain plans, value judgements or the equivalents. Thus LLMs can naturally align to human values to varying degrees, merely through their training as distillations of human thought. This of course by itself doesn’t guarantee alignment, but it is a much more hopeful situation to be in, because you can exert a great deal of control through control of the training data.
...yes? And this is obviously very, very different from how humans represent things internally?
I mean, for one thing, humans don’t recreate exact simulations of other humans in our brains (even though “predicting other humans” is arguably the high-level cognitive task we are most specced for doing). But even setting that aside, the Solomonoff inductor’s hypothesis also contains a bunch of stuff other than human brains, modeled in full detail—which again is not anything close to how humans model the world around us.
I admit to having some trouble following your (implicit) argument here. Is it that, because a Solomonoff inductor is capable of simulating humans, that makes it “human-like” in some sense relevant to alignment? (Specifically, that doing the plan-sampling thing Rob mentioned in the OP with a Solomonoff inductor will get you a safe result, because it’ll be “humans in other universes” writing the plans? If so, I don’t see how that follows at all; I’m pretty sure having humans somewhere inside of your model doesn’t mean that that part of your model is what ends up generating the high-level plans being sampled by the outer system.)
It really seems to me that if I accept what looks to me like your argument, I’m basically forced to conclude that anything with a simplicity prior (trained on human data) will be aligned, meaning (in turn) the orthogonality thesis is completely false. But… well, I obviously don’t buy that, so I’m puzzled that you seem to be stressing this point (in both this comment and other comments, e.g. this reply to me elsethread):
(to be clear, my response to this is basically everything I wrote above; this is not meant as its own separate quote-reply block)
That’s not what I mean by “internal representations”. I’m referring to the concepts learned by the model, and whether analogues for those concepts exist in human thought-space (and if so, how closely they match each other). It’s not at all clear to me that this occurs by default, and I don’t think the fact that there are some statistical similarities between the high-level encoding approaches being used means that similar concepts end up being converged to. (Which is what is relevant, on my model, when it comes to questions like “if you sample plans from this system, what kinds of plans does it end up outputting, and do they end up being unusually dangerous relative to the kinds of plans humans tend to sample?”)
I agree that sparse coding as an approach seems to have been anticipated by evolution, but your raising this point (and others like it), seemingly as an argument that this makes systems more likely to be aligned by default, feels thematically similar to some of my previous objections—which (roughly) is that you seem to be taking a fairly weak premise (statistical learning models likely have some kind of simplicity prior built in to their representation schema) and running with that premise wayyy further than I think is licensed—running, so far as I can tell, directly to the absolute edge of plausibility, with a conclusion something like “And therefore, these systems will be aligned.” I don’t think the logical leap here has been justified!
I think we are starting to talk past each other, so let me just summarize my position (and what i’m not arguing):
1.) ANNs and BNNs converge in their internal representations, in part because of how physics only permits a narrow pareto efficient solution set, but also because ANNs are literally trained as distillations of BNNs. (More well known/accepted now, but I argued/predicted this well in advance (at least as early as 2015)).
2.) Because of 1.), there is no problem with ‘alien thoughts’ based on mindspace geometry. That was just never going to be a problem.
3.) Neither 1 or 2 are sufficient for alignment by default—both points apply rather obviously to humans, who are clearly not aligned by default with other humans or humanity in general.
Earlier you said:
I then pointed out that full SI on a hypercomputer would result in recreating entire worlds with human minds, but that was a bit of a tangent. The more relevant point is more nuanced: AIXI is SI plus some reward function. So all different possible AIXI agents share the exact same world model, yet they have different reward functions and thus would generate different plans and may well end up killing each other or something.
So having exactly the same world model is not sufficient for alignment—I’m not and would never argue that
But if you train a LLM to distill human thought sequences, those thought sequences can implicitly contain plans, value judgements or the equivalents. Thus LLMs can naturally align to human values to varying degrees, merely through their training as distillations of human thought. This of course by itself doesn’t guarantee alignment, but it is a much more hopeful situation to be in, because you can exert a great deal of control through control of the training data.