Sufficient-for-Safety Goal Loading is Substantially Difficult. As a strong default, absent alignment breakthroughs, we won’t be able to cause one of the first STEM-level AGI systems to have sufficient-for-safety goals. (E.g., we won’t be able to give it the subset of human morality required for it to do ambitious things without destroying the world).
Hello Rob,
I was able to transfer a shutdown protocol to GPT2-medium by allowing it to learn from aligned patterns present in an archetypal dataset consisting of 549 stories that explain the shutdown phrase, called “activate Oath”. Archetypal Transfer Learning (ATL) allowed for full value loading in a model like GPT-2-medium and possibly in larger models. Based on my initial experiments using the ATL method, the more capable the system is—the easier it is to implement.