That’s very not enough. A superintelligence will be much more economically powerful than humans. If it merely exhibits normal human levels of benevolence, truth-telling, law-obeying, money-seeking, power-seeking and so on, it will deprive humans of everything.
It’s entirely legal to do jobs so cheaply that others can’t compete, and to show people optimized messages to make them spend savings on consumption. A superintelligence merely doing these two things superhumanly well, staying within the law, is sufficient to deprive most people of everything. Moreover, the money incentives point to building superintelligences that will do exactly these things, while rushing to the market and spending the minimum on alignment.
So, my guess at Leo’s reaction is one of RLHF-optimism. Even a bizarre-sounding idea like “get your ASI to love the US constitution” might be rationalized as merely a way to get a normal-looking world after you do RLHF++. And sure, if your AI jumps straight to maximizing the reward process, it will manipulate the humans and bad things happen, but learning is path-dependent and if you start with an AI with the right concepts and train it on non-manipulative cases first, the RLHF-optimist would say it’s not implausible that we can get an AI that genuinely doesn’t want to manipulate us like that.
Although I agree this is possible, and is in fact reason for modest optimism, it’s also based on a sort of hacky, unprincipled praxis of getting the model to learn good concepts, which probably fails a large percentage of the time even if we try our best. And even if it succeeds, I’m aesthetically displeased by a world that builds transformative AI and then uses it to largely maintain the status quo ante—something has gone wrong there, and worst case that wrongness will be reflected in the values learned by the AI these people made.
I think that response basically doesn’t work. But when I started writing in more detail why it doesn’t work, it morphed into a book review that I’ve wanted to write for the last couple years but always put it off. So thank you for finally making me write it!
>So, my guess at Leo’s reaction is one of RLHF-optimism.
This is more or less what he seems to say according to the transcript—he thinks we will have legible trustworthy chain of thought at least for the initial automated AI researchers, we can RLHF them, and then use them to do alignment research. This of course is not a new concept and has been debated here ad nauseum but it’s not a shocking view for a member of Ilya and Jan’s team and he clearly cosigns it in the interview.
So, my guess at Leo’s reaction is one of RLHF-optimism. Even a bizarre-sounding idea like “get your ASI to love the US constitution” might be rationalized as merely a way to get a normal-looking world after you do RLHF++. And sure, if your AI jumps straight to maximizing the reward process, it will manipulate the humans and bad things happen, but learning is path-dependent and if you start with an AI with the right concepts and train it on non-manipulative cases first, the RLHF-optimist would say it’s not implausible that we can get an AI that genuinely doesn’t want to manipulate us like that.
Although I agree this is possible, and is in fact reason for modest optimism, it’s also based on a sort of hacky, unprincipled praxis of getting the model to learn good concepts, which probably fails a large percentage of the time even if we try our best. And even if it succeeds, I’m aesthetically displeased by a world that builds transformative AI and then uses it to largely maintain the status quo ante—something has gone wrong there, and worst case that wrongness will be reflected in the values learned by the AI these people made.
I think that response basically doesn’t work. But when I started writing in more detail why it doesn’t work, it morphed into a book review that I’ve wanted to write for the last couple years but always put it off. So thank you for finally making me write it!
>So, my guess at Leo’s reaction is one of RLHF-optimism.
This is more or less what he seems to say according to the transcript—he thinks we will have legible trustworthy chain of thought at least for the initial automated AI researchers, we can RLHF them, and then use them to do alignment research. This of course is not a new concept and has been debated here ad nauseum but it’s not a shocking view for a member of Ilya and Jan’s team and he clearly cosigns it in the interview.