With what little I know now I think 2 would be most clear to people. However I appreciate that that might contribute to capabilities, so maybe exfohazard.
4 is definitely interesting, and I think there are actually a few significant papers about instrumental convergence. More of those would be good, but I don’t think that gets to the heart of the matter w.r.t a simple model to aid communication.
5. I would love some more information theory stuff, drilling into how much information is communicated to eg. a model relative to how much is contained in the world. This could at the very least put some bounds on orthogonality (if ‘alignment’ is seen in terms of ‘preserving information’). I feel like this could be a productive avenue, but personally worry its above my pay grade (I did an MSc in Experimental Physics but its getting rustier by the day).
Now I think about it, maybe 1 and 3 would also contribute to a ‘package’ if this was seen as a nothing but an attempt at didactics. But maybe including every step of the way complicates things too much, ideally there would be a core idea that could get most of the message across on its own. I think Orthogonality does this for a lot of people in LW, and maybe just a straightforward explainer of that with some information-theory sugar would be enough.
I was thinking more that the question here was also about more rigorous and less qualitative papers supporting the thesis, than just explanations for laypeople. One of the most common arguments against AI safety is that it’s unscientific because it doesn’t have rigorous theoretical support. I’m not super satisfied with that criticism (I feel like the general outlines are clear enough, and I don’t think you can really make up some quantitative framework to predict, e.g., which fraction of goals in the total possible goal-space benefit from power-seeking and self-preservation, so in the end you still have to go with the qualitative argument and your feel for how much does it apply to reality), but I think if it has to be allayed, it should be by something that targets specific links in the causal chain of Doom. Important side bonus, formalizing and investigating these problems might actually reveal interesting potential alignment ideas.
I’ll have to read those papers you linked, but to me in general it feels like perhaps the topic more amenable to this sort of treatment is indeed Instrumental Convergence. The Orthogonality Thesis feels to me more of a philosophical statement, and indeed we’ve had someone arguing for moral realism here just days ago. I don’t think you can really prove it or not from where we are. But I think if you phrased it as “being smart does not make you automatically good” you’d find that most people agree with you—especially people of the persuasion that right now regards AI safety and TESCREAL people as they dubbed us with most suspicion. Orthogonality is essentially moral relativism!
Now if we’re talking about a more outreach-oriented discussion, then I think all concepts can be explained pretty clearly. I’d also recommend using analogies to e.g. invasive species in new habitats, or the evils of colonialism, to stress why and how it’s both dangerous and unethical to unleash things that are more capable than us and are driven by too simple and greedy a goal on the world; insist on the fact that what makes us special is the richness and complexity of our values, and that our highest values are the ones that most prevent us from simply going on a power seeing rampage. That makes the notion of the first AGI being dangerous pretty clear: if you focus only on making them smart but you slack off on making them good, the latter part will be pretty rudimentary, and so you’re creating something that is like a colony of intelligent bacteria.
With what little I know now I think 2 would be most clear to people. However I appreciate that that might contribute to capabilities, so maybe exfohazard.
4 is definitely interesting, and I think there are actually a few significant papers about instrumental convergence. More of those would be good, but I don’t think that gets to the heart of the matter w.r.t a simple model to aid communication.
5. I would love some more information theory stuff, drilling into how much information is communicated to eg. a model relative to how much is contained in the world. This could at the very least put some bounds on orthogonality (if ‘alignment’ is seen in terms of ‘preserving information’). I feel like this could be a productive avenue, but personally worry its above my pay grade (I did an MSc in Experimental Physics but its getting rustier by the day).
Now I think about it, maybe 1 and 3 would also contribute to a ‘package’ if this was seen as a nothing but an attempt at didactics. But maybe including every step of the way complicates things too much, ideally there would be a core idea that could get most of the message across on its own. I think Orthogonality does this for a lot of people in LW, and maybe just a straightforward explainer of that with some information-theory sugar would be enough.
I was thinking more that the question here was also about more rigorous and less qualitative papers supporting the thesis, than just explanations for laypeople. One of the most common arguments against AI safety is that it’s unscientific because it doesn’t have rigorous theoretical support. I’m not super satisfied with that criticism (I feel like the general outlines are clear enough, and I don’t think you can really make up some quantitative framework to predict, e.g., which fraction of goals in the total possible goal-space benefit from power-seeking and self-preservation, so in the end you still have to go with the qualitative argument and your feel for how much does it apply to reality), but I think if it has to be allayed, it should be by something that targets specific links in the causal chain of Doom. Important side bonus, formalizing and investigating these problems might actually reveal interesting potential alignment ideas.
I’ll have to read those papers you linked, but to me in general it feels like perhaps the topic more amenable to this sort of treatment is indeed Instrumental Convergence. The Orthogonality Thesis feels to me more of a philosophical statement, and indeed we’ve had someone arguing for moral realism here just days ago. I don’t think you can really prove it or not from where we are. But I think if you phrased it as “being smart does not make you automatically good” you’d find that most people agree with you—especially people of the persuasion that right now regards AI safety and TESCREAL people as they dubbed us with most suspicion. Orthogonality is essentially moral relativism!
Now if we’re talking about a more outreach-oriented discussion, then I think all concepts can be explained pretty clearly. I’d also recommend using analogies to e.g. invasive species in new habitats, or the evils of colonialism, to stress why and how it’s both dangerous and unethical to unleash things that are more capable than us and are driven by too simple and greedy a goal on the world; insist on the fact that what makes us special is the richness and complexity of our values, and that our highest values are the ones that most prevent us from simply going on a power seeing rampage. That makes the notion of the first AGI being dangerous pretty clear: if you focus only on making them smart but you slack off on making them good, the latter part will be pretty rudimentary, and so you’re creating something that is like a colony of intelligent bacteria.