Your model assumes lot about the nature of AGI. Sure if you jump directly to “we’ve created coherent, agential, strategic strong AGI, what happens now?” you end up with a lot of default failure modes. The cruxes of disagreement are along what does AGI actually look like in practice and what are the circumstances around it’s creation?
Is it Agential? Does it have strategic planning capabilities that it tries to act on in the real world? Current systems don’t look like this.
Is it coherent? Even if it has the capability to strategically plan is it able to coherently pursue those goals over time? Current systems don’t even have the concept of time and there is some reason to believe that coherence and intelligence may have an inverse correlation.
Do we get successive chances to work on aligning a system? If “AGI” was derived from scaling LLMs and adding cognitive scaffolding doesn’t it seem highly likely they will both be interpretable and steerable given their use of natural language and ability to iterate on failures?
Is “kindness” truly completely orthogonal to intelligence? If there is even a slight positive correlation the future could look very different. Paul Christianio made an argument about this on a thread recently.
I think part of the challenge is that AGI is a very nebulous term and presupposing an agential, strategic, coherent AGI involves assuming a lot of steps in between. I think a lot of the disagreements rely on what the properties of the AGI are rather than specific claims about the likelihood of successful alignment. And there seems to be a lot of uncertainty on how this technology actually ends up developing that’s not accounted for in many of the standard AI X-Risk Models
One of the takehome lessons from ChaosGPT and AutoGPT is that there’ll likely end up being agential AIs, even if the original AI wasn’t particularly agentic.
AutoGPT is an excellent demonstration of the point. Ask someone on this forum 5 years ago whether they think AGI might be a series of next token predictors strung together with modular cognition occurring in English and they would have called you insane.
Yet if that is how we get something close to AGI it seems like a best case scenario since intrepretability is solved by default and you can measure alignment progress very easily.
Your model assumes lot about the nature of AGI. Sure if you jump directly to “we’ve created coherent, agential, strategic strong AGI, what happens now?” you end up with a lot of default failure modes. The cruxes of disagreement are along what does AGI actually look like in practice and what are the circumstances around it’s creation?
Is it Agential? Does it have strategic planning capabilities that it tries to act on in the real world? Current systems don’t look like this.
Is it coherent? Even if it has the capability to strategically plan is it able to coherently pursue those goals over time? Current systems don’t even have the concept of time and there is some reason to believe that coherence and intelligence may have an inverse correlation.
Do we get successive chances to work on aligning a system? If “AGI” was derived from scaling LLMs and adding cognitive scaffolding doesn’t it seem highly likely they will both be interpretable and steerable given their use of natural language and ability to iterate on failures?
Is “kindness” truly completely orthogonal to intelligence? If there is even a slight positive correlation the future could look very different. Paul Christianio made an argument about this on a thread recently.
I think part of the challenge is that AGI is a very nebulous term and presupposing an agential, strategic, coherent AGI involves assuming a lot of steps in between. I think a lot of the disagreements rely on what the properties of the AGI are rather than specific claims about the likelihood of successful alignment. And there seems to be a lot of uncertainty on how this technology actually ends up developing that’s not accounted for in many of the standard AI X-Risk Models
One of the takehome lessons from ChaosGPT and AutoGPT is that there’ll likely end up being agential AIs, even if the original AI wasn’t particularly agentic.
AutoGPT is an excellent demonstration of the point. Ask someone on this forum 5 years ago whether they think AGI might be a series of next token predictors strung together with modular cognition occurring in English and they would have called you insane.
Yet if that is how we get something close to AGI it seems like a best case scenario since intrepretability is solved by default and you can measure alignment progress very easily.
Reality is weird in very unexpected ways.