Mathematically we have done what amounts to elaborate fudging and approximation to create an ultracomplex non-linear hyperdimensional surface. We cannot create something like this directly because we cannot do multiple multiple regressions on accurate models of complex systems with multiple feedback pathways and etc (ie, the real world). Maybe in another 40 years, the guys at Sante Fe institute will invent a mathematics so we can directly describe what’s going on in a neural network, but currently we cannot because it’s very hard to discuss it in specific cases with our mathematics. People looking to make aligned neural networks should perhaps invent a method for making them that doesn’t use fudging and approximation (“direct-drive” like a multiple regression, rather than “indirect drive” like backpropagation).
All this is known, right? So given GPT-3 has dozens of billions of x variables driving its hyperdimensional vector-space, I reckon we should expect this kind of thing to lurk in some little divot on some particular squiggly curve along vectors 346,781 and 1,209,276,886. I guess there should be vast numbers of such lurking divots and squiggles in the curves of any such system, probably that do way worse things than get the AI to say it likes Hitler and here’s how to make meth. Moreover, SolidGoldMagiKarp seems like a mundane example that was easily found out because it was human-readable and someone’s username.
Mathematically we have done what amounts to elaborate fudging and approximation to create an ultracomplex non-linear hyperdimensional surface. We cannot create something like this directly because we cannot do multiple multiple regressions on accurate models of complex systems with multiple feedback pathways and etc (ie, the real world). Maybe in another 40 years, the guys at Sante Fe institute will invent a mathematics so we can directly describe what’s going on in a neural network, but currently we cannot because it’s very hard to discuss it in specific cases with our mathematics. People looking to make aligned neural networks should perhaps invent a method for making them that doesn’t use fudging and approximation (“direct-drive” like a multiple regression, rather than “indirect drive” like backpropagation).
All this is known, right? So given GPT-3 has dozens of billions of x variables driving its hyperdimensional vector-space, I reckon we should expect this kind of thing to lurk in some little divot on some particular squiggly curve along vectors 346,781 and 1,209,276,886. I guess there should be vast numbers of such lurking divots and squiggles in the curves of any such system, probably that do way worse things than get the AI to say it likes Hitler and here’s how to make meth. Moreover, SolidGoldMagiKarp seems like a mundane example that was easily found out because it was human-readable and someone’s username.