I haven’t seen an answer by Eliezer. But I can go through the first post, and highlight what I think is wrong. (And would be unsurprised if Eliezer agreed with much of it)
AIs are white boxes
We can see literally every neuron, but have little clue what they are doing.
Black box methods are sufficient for human alignment
Humans are aligned to human values because humans have human genes. Also individual humans can’t replicate themselves, which makes taking over the world much harder.
most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social.
Humans have specific genes for absorbing cultural values, at least within a range of human cultures. There are various alien values that humans won’t absorb.
Gradient descent is very powerful because, unlike a black box method, it’s almost impossible to trick.
Hmm. I don’t think the case for that is convincing.
If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.
Current AI techniques involve giving the AI loads of neurons, so having a few neurons that aren’t being used isn’t a problem.
Also, it’s possible that the same neurons that sometimes plot to kill you are also sometimes used to predict plots in murder mystery books.
In general, gradient descent has a strong tendency to favor the simplest solution which performs well, and secret murder plots aren’t actively useful for improving performance on the tasks humans will actually optimize AIs to perform.
If you give the AI lots of tasks, it’s possible that the simplest solution is some kind of internal general optimizer.
Either you have an AI that is smart and general and can try new things that are substantially different from anything it’s done before. (In which case the new things can include murder plots) Or you have an AI that’s dumb and is only repeating small variations on it’s training data.
We can run large numbers of experiments to find the most effective interventions
Current techniques are based on experiments/gradient descent. This works so long as the AI’s can’t break out of the sandbox or realize they are being experimented on and plot to trick the experimenters. You can’t keep an ASI in a little sandbox and run gradient descent on it.
Our reward circuitry reliably imprints a set of motivational invariants into the psychology of every human: we have empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc.
Sure. And we use contraception. Which kind of shows that evolution failed somewhere a bit.
Also, evolution got a long time testing and refining with humans that didn’t have the tools to mess with evolution or even understand it.
Even in the pessimistic scenario where AIs stop obeying our every command, they will still protect us and improve our welfare, because they will have learned an ethical code very early in training.
No one is claiming the ASI won’t understand human values, they are saying it won’t care.
Is that evidence that LLM’s actually care about morality. Not really. It’s evidence that they are good at predicting humans. Get them predicting an ethics professor and they will answer morality questions. Get them predicting Hitler and they will say less moral things.
And of course, there is a big difference between an AI that says “be nice to people” and an AI that is nice to people. The former can be trivially achieved by hard coding a list of platitudes for the AI to parrot back. The second requires the AI to make decisions like “are unborn babies people?”.
Imagine some robot running around. You have an LLM that says nice-sounding things when posed ethical dilemmas. You need some system that turns the raw camera input into a text description, and the nice sounding output into actual actions.
I haven’t seen an answer by Eliezer. But I can go through the first post, and highlight what I think is wrong. (And would be unsurprised if Eliezer agreed with much of it)
We can see literally every neuron, but have little clue what they are doing.
Humans are aligned to human values because humans have human genes. Also individual humans can’t replicate themselves, which makes taking over the world much harder.
Humans have specific genes for absorbing cultural values, at least within a range of human cultures. There are various alien values that humans won’t absorb.
Hmm. I don’t think the case for that is convincing.
Current AI techniques involve giving the AI loads of neurons, so having a few neurons that aren’t being used isn’t a problem.
Also, it’s possible that the same neurons that sometimes plot to kill you are also sometimes used to predict plots in murder mystery books.
If you give the AI lots of tasks, it’s possible that the simplest solution is some kind of internal general optimizer.
Either you have an AI that is smart and general and can try new things that are substantially different from anything it’s done before. (In which case the new things can include murder plots) Or you have an AI that’s dumb and is only repeating small variations on it’s training data.
Current techniques are based on experiments/gradient descent. This works so long as the AI’s can’t break out of the sandbox or realize they are being experimented on and plot to trick the experimenters. You can’t keep an ASI in a little sandbox and run gradient descent on it.
Sure. And we use contraception. Which kind of shows that evolution failed somewhere a bit.
Also, evolution got a long time testing and refining with humans that didn’t have the tools to mess with evolution or even understand it.
No one is claiming the ASI won’t understand human values, they are saying it won’t care.
Is that evidence that LLM’s actually care about morality. Not really. It’s evidence that they are good at predicting humans. Get them predicting an ethics professor and they will answer morality questions. Get them predicting Hitler and they will say less moral things.
And of course, there is a big difference between an AI that says “be nice to people” and an AI that is nice to people. The former can be trivially achieved by hard coding a list of platitudes for the AI to parrot back. The second requires the AI to make decisions like “are unborn babies people?”.
Imagine some robot running around. You have an LLM that says nice-sounding things when posed ethical dilemmas. You need some system that turns the raw camera input into a text description, and the nice sounding output into actual actions.