I think if you push anything [referring to AI systems] far enough, especially on anything remotely like the current paradigms, like if you make it capable enough, the way it gets that capable is by starting to be general.
And at the same sort of point where it starts to be general, it will start to have it’s own internal preferences, because that is how you get to be general. You don’t become creative and able to solve lots and lots of problems without something inside you that organizes your problem solving and that thing is like a preference and a goal. It’s not built in explicitly, it’s just something that’s sought out by the process that we use to grow these things to be more and more capable.
It caught my attention, because it’s a concise encapsulation of something that I already knew Eliezer thought, and which seems to me to be a crux between “man, we’re probably all going to die” and “we’re really really fucked”, but which I don’t myself understand.
So I’m taking a few minutes to think through it afresh now.
I agree that systems get to be very powerful by dint of their generality.
(There are some nuances around that: part of what makes GPT-4 and Claude so useful is just that they’ve memorized so much of the internet. That massive knowledge base helps make up for their relatively shallow levels of intelligence, compared to smart humans. But the dangerous/scary thing is definitely AI systems that are general enough to do full science and engineering processes.)
I don’t (yet?) see why generality implies having a stable motivating preference.
If an AI system is doing problem solving, that does definitely entail that it has a goal, at least in some local sense: It has the goal of solving the problem in question. But that level of goal is more analogous to the prompt given to an LLM than it is to a robust utility function.
I do have the intuition that creating an SEAI by training an RL agent on millions of simulated engineering problems is scary, because of reward specification problems of your simulated engineering problems. It will learn to hack your metrics.
But an LLM trained on next-token prediction doesn’t have that problem?
Could you use next token prediction to build a detailed world model, that contains deep abstractions that describe reality (beyond the current human abstractions), and then prompt it, to elicit those models?
Something like, you have the AI do next token prediction on all the physics papers, and all the physics time-series, and all the text on the internet, and then you prompt it to write the groundbreaking new physics result that unifies QM and GR, citing previously overlooked evidence.
I think Eliezer says “no, you can’t, because to discover deep theories like that requires thinking and not just “passive” learning in the ML sense of updating gradients until you learn abstractions that predict the data well. You need to generate hypotheses and test them.”
In my state of knowledge, I don’t know if that’s true.
Is that a crux for him? How much easier is the alignment problem, if it’s possible to learn superhuman abstractions “passively” like that?
I mean there’s still a problem that someone will build a more dangerous agent from components like that. And there’s still a problem that you can get world-altering technologies / world-destroying technologies from that kind of oracle.
We’re not out of the woods. But it would mean that building a superhuman SEAI isn’t an immediate death sentence for humanity.
Having any preference at all is almost always served by an instrumental preference of survival as an agent with that preference.
Once a competent agent is general enough to notice that (and granting that it has a level of generality sufficient to require a preference), then the first time it has a preference, it will want to take actions to preserve that preference.
Could you use next token prediction to build a detailed world model, that contains deep abstractions that describe reality (beyond the current human abstractions), and then prompt it, to elicit those models?
This seems possible to me. Humans have plenty of text in which we generate new abstractions/hypotheses, and so effective next-token prediction would necessitate forming a model of that process. Once the AI has human-level ability to create new abstractions, it could then simulate experiments (via e.g. its ability to predict python code outputs) and cross-examine the results with its own knowledge to adjust them and pick out the best ones.
In Section 1 of this post I make an argument kinda similar to the one you’re attributing to Eliezer. That might or might not help you, I dunno, just wanted to share.
In this interview, Eliezer says the following:
It caught my attention, because it’s a concise encapsulation of something that I already knew Eliezer thought, and which seems to me to be a crux between “man, we’re probably all going to die” and “we’re really really fucked”, but which I don’t myself understand.
So I’m taking a few minutes to think through it afresh now.
I agree that systems get to be very powerful by dint of their generality.
(There are some nuances around that: part of what makes GPT-4 and Claude so useful is just that they’ve memorized so much of the internet. That massive knowledge base helps make up for their relatively shallow levels of intelligence, compared to smart humans. But the dangerous/scary thing is definitely AI systems that are general enough to do full science and engineering processes.)
I don’t (yet?) see why generality implies having a stable motivating preference.
If an AI system is doing problem solving, that does definitely entail that it has a goal, at least in some local sense: It has the goal of solving the problem in question. But that level of goal is more analogous to the prompt given to an LLM than it is to a robust utility function.
I do have the intuition that creating an SEAI by training an RL agent on millions of simulated engineering problems is scary, because of reward specification problems of your simulated engineering problems. It will learn to hack your metrics.
But an LLM trained on next-token prediction doesn’t have that problem?
Could you use next token prediction to build a detailed world model, that contains deep abstractions that describe reality (beyond the current human abstractions), and then prompt it, to elicit those models?
Something like, you have the AI do next token prediction on all the physics papers, and all the physics time-series, and all the text on the internet, and then you prompt it to write the groundbreaking new physics result that unifies QM and GR, citing previously overlooked evidence.
I think Eliezer says “no, you can’t, because to discover deep theories like that requires thinking and not just “passive” learning in the ML sense of updating gradients until you learn abstractions that predict the data well. You need to generate hypotheses and test them.”
In my state of knowledge, I don’t know if that’s true.
Is that a crux for him? How much easier is the alignment problem, if it’s possible to learn superhuman abstractions “passively” like that?
I mean there’s still a problem that someone will build a more dangerous agent from components like that. And there’s still a problem that you can get world-altering technologies / world-destroying technologies from that kind of oracle.
We’re not out of the woods. But it would mean that building a superhuman SEAI isn’t an immediate death sentence for humanity.
I think I still don’t get it.
In my view, this is where the Omohundro Drives come into play.
Having any preference at all is almost always served by an instrumental preference of survival as an agent with that preference.
Once a competent agent is general enough to notice that (and granting that it has a level of generality sufficient to require a preference), then the first time it has a preference, it will want to take actions to preserve that preference.
This seems possible to me. Humans have plenty of text in which we generate new abstractions/hypotheses, and so effective next-token prediction would necessitate forming a model of that process. Once the AI has human-level ability to create new abstractions, it could then simulate experiments (via e.g. its ability to predict python code outputs) and cross-examine the results with its own knowledge to adjust them and pick out the best ones.
Sorry, what’s the difference between these two positions? Is the second one meant to be a more extreme version of the first?
Yes.
In Section 1 of this post I make an argument kinda similar to the one you’re attributing to Eliezer. That might or might not help you, I dunno, just wanted to share.