Anyone want to predict when we’ll reach the same level of translation and other language capability as GPT-3 via iterated amplification or another “aligned” approach? (How far behind is alignment work compared to capability work?)
I think GPT-3 should be viewed as roughly as aligned as IDA would be if we pursued it using our current understanding. GPT-3 is trained via self-supervised learning (which is, on the face of it, myopic), so the only obvious x-safety concerns are something like mesa-optimization.
In my mind, the main argument for IDA being safe is still myopia.
I think GPT-3 seems safer than (recursive) reward modelling, CIRL, or any other alignment proposals based on deliberately building agent-y AI systems.
--------------------
In the above, I’m ignoring the ways in which any of these systems increase x-risk via their (e.g. destabilizing) social impact and/or contribution towards accelerating timelines.
Anyone want to predict when we’ll reach the same level of translation and other language capability as GPT-3 via iterated amplification or another “aligned” approach? (How far behind is alignment work compared to capability work?)
I think GPT-3 should be viewed as roughly as aligned as IDA would be if we pursued it using our current understanding. GPT-3 is trained via self-supervised learning (which is, on the face of it, myopic), so the only obvious x-safety concerns are something like mesa-optimization.
In my mind, the main argument for IDA being safe is still myopia.
I think GPT-3 seems safer than (recursive) reward modelling, CIRL, or any other alignment proposals based on deliberately building agent-y AI systems.
--------------------
In the above, I’m ignoring the ways in which any of these systems increase x-risk via their (e.g. destabilizing) social impact and/or contribution towards accelerating timelines.