jacob_cannell comments on The basic reasons I expect AGI ruin

jacob_cannell 22 Apr 2023 0:11 UTC
2 points
0
I think we are starting to talk past each other, so let me just summarize my position (and what i’m not arguing):

1.) ANNs and BNNs converge in their internal representations, in part because of how physics only permits a narrow pareto efficient solution set, but also because ANNs are literally trained as distillations of BNNs. (More well known/accepted now, but I argued/predicted this well in advance (at least as early as 2015)).

2.) Because of 1.), there is no problem with ‘alien thoughts’ based on mindspace geometry. That was just never going to be a problem.

3.) Neither 1 or 2 are sufficient for alignment by default—both points apply rather obviously to humans, who are clearly not aligned by default with other humans or humanity in general.

Earlier you said:

A Solomonoff inductor (AIXI) running on a hypercomputer would also “learn from basically the same data” (sensory data produced by the physical universe) with “similar training objectives” (predict the next bit of sensory information) using “universal approximations of Bayesian inference” (a perfect approximation, in this case), and yet it would not be the case that you could then conclude that AIXI “learns very similar internal functions/models”.

I then pointed out that full SI on a hypercomputer would result in recreating entire worlds with human minds, but that was a bit of a tangent. The more relevant point is more nuanced: AIXI is SI plus some reward function. So all different possible AIXI agents share the exact same world model, yet they have different reward functions and thus would generate different plans and may well end up killing each other or something.

So having exactly the same world model is not sufficient for alignment—I’m not and would never argue that

But if you train a LLM to distill human thought sequences, those thought sequences can implicitly contain plans, value judgements or the equivalents. Thus LLMs can naturally align to human values to varying degrees, merely through their training as distillations of human thought. This of course by itself doesn’t guarantee alignment, but it is a much more hopeful situation to be in, because you can exert a great deal of control through control of the training data.