Your assessment here seems to (mostly) line up with what I was trying to communicate in the post.
This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.
This is something I hoped to communicate in the “Mesa-Learning Everywhere?” section, especially point #3.
If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it.
This is a point I hoped to convey in the Search vs Control section.
If you mean the chance than a policy trained by RL will “learn” without gradient descent, I can’t imagine a way that could fail to be true for an intelligent system trained by deep RL
Ah, here is where the disagreement seems to lie. In another comment, you write:
Here on LW / AF, “mesa optimization” seems to only apply if there’s some sort of “general” learning algorithm, especially one that is “using search”, for reasons that have always been unclear to me.
I currently think this:
There is a spectrum between “just learning the task” vs “learning to learn”, which has to do with how “general” the learning is. DQN looking at the ball is very far on the “just learning the task” side.
This spectrum is very fuzzy. There is no clear distinction.
This spectrum is very relevant to inner alignment questions. If a system like GPT-3 is merely “locating the task”, then its behavior is highly constrained by the training set. On the other hand, if GPT-3 is “learning on the fly”, then its behavior is much less constrained by the training set, and have correspondingly more potential for misaligned behavior (behavior which is capably achieving a different goal than the intended one). This is justified by an interpolation-vs-extrapolation type intuition.
The paper provides a small amount of evidence that things higher on the spectrum are likely to happen. (I’m going to revise the post to indicate that the paper only provides a small amount of evidence—I admit I didn’t read the paper to see exactly what they did, and should have anticipated that it would be something relatively unimpressive like multi-armed-bandit.)
Thinking about the spectrum, I see no reason not to expect things to continue climbing that spectrum. This updates me significantly toward expecting inner alignment problems to be probable, compared with the previous way I was thinking about it.
All of that seems reasonable. (I indeed misunderstood your claim, mostly because you cited the spontaneous emergence post.)
Thinking about the spectrum, I see no reason not to expect things to continue climbing that spectrum. This updates me significantly toward expecting inner alignment problems to be probable, compared with the previous way I was thinking about it.
Fair enough; I guess I’m unclear on how you can think about it other than this way.
Random question: does this also update you towards “alignment problems will manifest in real systems well before they are powerful enough to take over the world”?
Context: I see this as a key claim for the (relative to MIRI) alignment-by-default perspective, and I expect many people at MIRI disagree with this claim (though I don’t know why they disagree).
Yep. I’d love to see more discussion around these cruxes (e.g. I’d be up for a public or private discussion sometime, or moderating one with someone from MIRI). I’d guess some of the main underlying cruxes are:
How hard are these problems to fix?
How motivated will the research community be to fix them?
How likely will developers be to use the fixes?
How reliably will developers need to use the fixes? (e.g. how much x-risk would result from a small company *not* using them?)
Personally, OTTMH (numbers pulled out of my ass), my views on these cruxes are:
It’s hard to say, but I’d say there’s a ~85% chance they are extremely difficult (effectively intractable on short-to-medium (~40yrs) timelines).
A small minority (~1-20%) of researchers will be highly motivated to fix them, once they are apparent/prominent. More researchers (~10-80%) will focus on patches.
Conditioned on fixes being easy and cheap to apply, large orgs will be very likely to use them (~90%); small orgs less so (~50%). Fixes are likely to be easy to apply (we’ll build good tools), if they are cheap enough to be deemed “practical”, but very unlikely (~10%) to be cheap enough.
It will probably need to be highly reliable; “the necessary intelligence/resources needed to destroy the world goes down every year” (unless we make a lot of progress of governance, which seems fairly unlikely (~15%))
Sure, also making up numbers, everything conditional on the neural net paradigm, and only talking about failures of single-single intent alignment:
~90% that there aren’t problems or we “could” fix them on 40 year timelines
I’m not sure exactly what is meant by motivation so will not predict, but there will be many people working on fixing the problems
“Are fixes used” is not a question in my ontology; something counts as a “fix” only if it’s cheap enough to be used. You could ask “did the team fail to use an existing fix that counterfactually would have made the difference between existential catastrophe and not” (possibly because they didn’t know of its existence), then < 10% and I don’t have enough information to distinguish between 0-10%.
I’ll answer “how much x-risk would result from a small company *not* using them”, if it’s a single small company then < 10% and I don’t have enough information to distinguish between 0-10% and I expect on reflection I’d say < 1%.
I guess most of my cruxes are RE your 2nd “=>”, and can almost be viewed as breaking down this question into sub-questions. It might be worth sketching out a quantitative model here.
Your assessment here seems to (mostly) line up with what I was trying to communicate in the post.
This is something I hoped to communicate in the “Mesa-Learning Everywhere?” section, especially point #3.
This is a point I hoped to convey in the Search vs Control section.
Ah, here is where the disagreement seems to lie. In another comment, you write:
I currently think this:
There is a spectrum between “just learning the task” vs “learning to learn”, which has to do with how “general” the learning is. DQN looking at the ball is very far on the “just learning the task” side.
This spectrum is very fuzzy. There is no clear distinction.
This spectrum is very relevant to inner alignment questions. If a system like GPT-3 is merely “locating the task”, then its behavior is highly constrained by the training set. On the other hand, if GPT-3 is “learning on the fly”, then its behavior is much less constrained by the training set, and have correspondingly more potential for misaligned behavior (behavior which is capably achieving a different goal than the intended one). This is justified by an interpolation-vs-extrapolation type intuition.
The paper provides a small amount of evidence that things higher on the spectrum are likely to happen. (I’m going to revise the post to indicate that the paper only provides a small amount of evidence—I admit I didn’t read the paper to see exactly what they did, and should have anticipated that it would be something relatively unimpressive like multi-armed-bandit.)
Thinking about the spectrum, I see no reason not to expect things to continue climbing that spectrum. This updates me significantly toward expecting inner alignment problems to be probable, compared with the previous way I was thinking about it.
All of that seems reasonable. (I indeed misunderstood your claim, mostly because you cited the spontaneous emergence post.)
Fair enough; I guess I’m unclear on how you can think about it other than this way.
Yeahhh, idk. All I can currently articulate is that, previously, I thought of it as a black swan event.
Random question: does this also update you towards “alignment problems will manifest in real systems well before they are powerful enough to take over the world”?
Context: I see this as a key claim for the (relative to MIRI) alignment-by-default perspective, and I expect many people at MIRI disagree with this claim (though I don’t know why they disagree).
I’m very curious to know whether people at MIRI in fact disagree with this claim.
I would expect that they don’t… e.g. Eliezer seems to think we’ll see them and patch them unsuccessfully: https://www.facebook.com/jefftk/posts/886930452142?comment_id=886983450932&comment_tracking=%7B%22tn%22%3A%22R%22%7D
Yeah it’s plausible that the actual claims MIRI would disagree with are more like:
Problems manifest ⇒ high likelihood we understand the underlying cause
We understand the underlying cause ⇒ high likelihood we fix it (or don’t build powerful AI) rather than applying “surface patches”
Yep. I’d love to see more discussion around these cruxes (e.g. I’d be up for a public or private discussion sometime, or moderating one with someone from MIRI). I’d guess some of the main underlying cruxes are:
How hard are these problems to fix?
How motivated will the research community be to fix them?
How likely will developers be to use the fixes?
How reliably will developers need to use the fixes? (e.g. how much x-risk would result from a small company *not* using them?)
Personally, OTTMH (numbers pulled out of my ass), my views on these cruxes are:
It’s hard to say, but I’d say there’s a ~85% chance they are extremely difficult (effectively intractable on short-to-medium (~40yrs) timelines).
A small minority (~1-20%) of researchers will be highly motivated to fix them, once they are apparent/prominent. More researchers (~10-80%) will focus on patches.
Conditioned on fixes being easy and cheap to apply, large orgs will be very likely to use them (~90%); small orgs less so (~50%). Fixes are likely to be easy to apply (we’ll build good tools), if they are cheap enough to be deemed “practical”, but very unlikely (~10%) to be cheap enough.
It will probably need to be highly reliable; “the necessary intelligence/resources needed to destroy the world goes down every year” (unless we make a lot of progress of governance, which seems fairly unlikely (~15%))
Sure, also making up numbers, everything conditional on the neural net paradigm, and only talking about failures of single-single intent alignment:
~90% that there aren’t problems or we “could” fix them on 40 year timelines
I’m not sure exactly what is meant by motivation so will not predict, but there will be many people working on fixing the problems
“Are fixes used” is not a question in my ontology; something counts as a “fix” only if it’s cheap enough to be used. You could ask “did the team fail to use an existing fix that counterfactually would have made the difference between existential catastrophe and not” (possibly because they didn’t know of its existence), then < 10% and I don’t have enough information to distinguish between 0-10%.
I’ll answer “how much x-risk would result from a small company *not* using them”, if it’s a single small company then < 10% and I don’t have enough information to distinguish between 0-10% and I expect on reflection I’d say < 1%.
I guess most of my cruxes are RE your 2nd “=>”, and can almost be viewed as breaking down this question into sub-questions. It might be worth sketching out a quantitative model here.