All of that seems reasonable. (I indeed misunderstood your claim, mostly because you cited the spontaneous emergence post.)
Thinking about the spectrum, I see no reason not to expect things to continue climbing that spectrum. This updates me significantly toward expecting inner alignment problems to be probable, compared with the previous way I was thinking about it.
Fair enough; I guess I’m unclear on how you can think about it other than this way.
Random question: does this also update you towards “alignment problems will manifest in real systems well before they are powerful enough to take over the world”?
Context: I see this as a key claim for the (relative to MIRI) alignment-by-default perspective, and I expect many people at MIRI disagree with this claim (though I don’t know why they disagree).
Yep. I’d love to see more discussion around these cruxes (e.g. I’d be up for a public or private discussion sometime, or moderating one with someone from MIRI). I’d guess some of the main underlying cruxes are:
How hard are these problems to fix?
How motivated will the research community be to fix them?
How likely will developers be to use the fixes?
How reliably will developers need to use the fixes? (e.g. how much x-risk would result from a small company *not* using them?)
Personally, OTTMH (numbers pulled out of my ass), my views on these cruxes are:
It’s hard to say, but I’d say there’s a ~85% chance they are extremely difficult (effectively intractable on short-to-medium (~40yrs) timelines).
A small minority (~1-20%) of researchers will be highly motivated to fix them, once they are apparent/prominent. More researchers (~10-80%) will focus on patches.
Conditioned on fixes being easy and cheap to apply, large orgs will be very likely to use them (~90%); small orgs less so (~50%). Fixes are likely to be easy to apply (we’ll build good tools), if they are cheap enough to be deemed “practical”, but very unlikely (~10%) to be cheap enough.
It will probably need to be highly reliable; “the necessary intelligence/resources needed to destroy the world goes down every year” (unless we make a lot of progress of governance, which seems fairly unlikely (~15%))
Sure, also making up numbers, everything conditional on the neural net paradigm, and only talking about failures of single-single intent alignment:
~90% that there aren’t problems or we “could” fix them on 40 year timelines
I’m not sure exactly what is meant by motivation so will not predict, but there will be many people working on fixing the problems
“Are fixes used” is not a question in my ontology; something counts as a “fix” only if it’s cheap enough to be used. You could ask “did the team fail to use an existing fix that counterfactually would have made the difference between existential catastrophe and not” (possibly because they didn’t know of its existence), then < 10% and I don’t have enough information to distinguish between 0-10%.
I’ll answer “how much x-risk would result from a small company *not* using them”, if it’s a single small company then < 10% and I don’t have enough information to distinguish between 0-10% and I expect on reflection I’d say < 1%.
I guess most of my cruxes are RE your 2nd “=>”, and can almost be viewed as breaking down this question into sub-questions. It might be worth sketching out a quantitative model here.
All of that seems reasonable. (I indeed misunderstood your claim, mostly because you cited the spontaneous emergence post.)
Fair enough; I guess I’m unclear on how you can think about it other than this way.
Yeahhh, idk. All I can currently articulate is that, previously, I thought of it as a black swan event.
Random question: does this also update you towards “alignment problems will manifest in real systems well before they are powerful enough to take over the world”?
Context: I see this as a key claim for the (relative to MIRI) alignment-by-default perspective, and I expect many people at MIRI disagree with this claim (though I don’t know why they disagree).
I’m very curious to know whether people at MIRI in fact disagree with this claim.
I would expect that they don’t… e.g. Eliezer seems to think we’ll see them and patch them unsuccessfully: https://www.facebook.com/jefftk/posts/886930452142?comment_id=886983450932&comment_tracking=%7B%22tn%22%3A%22R%22%7D
Yeah it’s plausible that the actual claims MIRI would disagree with are more like:
Problems manifest ⇒ high likelihood we understand the underlying cause
We understand the underlying cause ⇒ high likelihood we fix it (or don’t build powerful AI) rather than applying “surface patches”
Yep. I’d love to see more discussion around these cruxes (e.g. I’d be up for a public or private discussion sometime, or moderating one with someone from MIRI). I’d guess some of the main underlying cruxes are:
How hard are these problems to fix?
How motivated will the research community be to fix them?
How likely will developers be to use the fixes?
How reliably will developers need to use the fixes? (e.g. how much x-risk would result from a small company *not* using them?)
Personally, OTTMH (numbers pulled out of my ass), my views on these cruxes are:
It’s hard to say, but I’d say there’s a ~85% chance they are extremely difficult (effectively intractable on short-to-medium (~40yrs) timelines).
A small minority (~1-20%) of researchers will be highly motivated to fix them, once they are apparent/prominent. More researchers (~10-80%) will focus on patches.
Conditioned on fixes being easy and cheap to apply, large orgs will be very likely to use them (~90%); small orgs less so (~50%). Fixes are likely to be easy to apply (we’ll build good tools), if they are cheap enough to be deemed “practical”, but very unlikely (~10%) to be cheap enough.
It will probably need to be highly reliable; “the necessary intelligence/resources needed to destroy the world goes down every year” (unless we make a lot of progress of governance, which seems fairly unlikely (~15%))
Sure, also making up numbers, everything conditional on the neural net paradigm, and only talking about failures of single-single intent alignment:
~90% that there aren’t problems or we “could” fix them on 40 year timelines
I’m not sure exactly what is meant by motivation so will not predict, but there will be many people working on fixing the problems
“Are fixes used” is not a question in my ontology; something counts as a “fix” only if it’s cheap enough to be used. You could ask “did the team fail to use an existing fix that counterfactually would have made the difference between existential catastrophe and not” (possibly because they didn’t know of its existence), then < 10% and I don’t have enough information to distinguish between 0-10%.
I’ll answer “how much x-risk would result from a small company *not* using them”, if it’s a single small company then < 10% and I don’t have enough information to distinguish between 0-10% and I expect on reflection I’d say < 1%.
I guess most of my cruxes are RE your 2nd “=>”, and can almost be viewed as breaking down this question into sub-questions. It might be worth sketching out a quantitative model here.