Often, disagreements boil down to a set of open questions to answer; here’s my best guess at how to decompose your disagreements.
I think that depending on what hypothesis you’re abiding by when it comes to how LLMs will generalise to AGI, you get different answers:
Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don’t become power-seeking.
Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don’t generally give evidence about AGI.
Depending on your beliefs about these two hypotheses, you will have different opinions on this question.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
Let’s say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn’t give evidence about AGI?
1. Intelligence forces reflective coherence. This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values).
2. Agentic AI acting in the real world is different from LLMs. If we look at an LLM from the perspective of an action-perception loop, it doesn’t generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world.
3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future.
Personal belief: These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don’t really believe that human values are as fragile as we think.
I’m more scared of 1 and 2 than I’m of 3, but I would still love for us to have ten more years to figure this out as it seems very non-obvious as to what the answers here are.
Often, disagreements boil down to a set of open questions to answer; here’s my best guess at how to decompose your disagreements.
I think that depending on what hypothesis you’re abiding by when it comes to how LLMs will generalise to AGI, you get different answers:
Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don’t become power-seeking.
Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don’t generally give evidence about AGI.
Depending on your beliefs about these two hypotheses, you will have different opinions on this question.
Let’s say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn’t give evidence about AGI?
1. Intelligence forces reflective coherence.
This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values).
2. Agentic AI acting in the real world is different from LLMs.
If we look at an LLM from the perspective of an action-perception loop, it doesn’t generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world.
3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future.
Personal belief:
These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don’t really believe that human values are as fragile as we think.
I’m more scared of 1 and 2 than I’m of 3, but I would still love for us to have ten more years to figure this out as it seems very non-obvious as to what the answers here are.