Grad student in NLP
Darcey
South Bend, Indiana, USA – ACX Meetups Everywhere Fall 2023
I’m a little bit skeptical of the argument in “Transformers are not special”—it seems like, if there were other architectures which had slightly greater capabilities than the Transformer, and which were relatively low-hanging fruit, we would have found them already.
I’m in academia, so I can’t say for sure what is going on at big companies like Google. But I assume that, following the 2017 release of the Transformer, they allocated different research teams to pursuing different directions: some research teams for scaling, and others for the development of new architectures. It seems like, at least in NLP, all of the big flashy new models have come about via scaling. This suggests to me that, within companies like Google, the research teams assigned to scaling are experiencing success, while the research teams assigned to new architectures are not.
It genuinely surprises me that the Transformer still has not been replaced as the dominant architecture since 2017. It does not surprise me that sufficiently large or fancy RNNs can achieve similar performance to the Transformer. The lack of Transformer replacements makes me wonder whether we have hit the limit on the effectiveness of autoregressive language models, though I also wouldn’t be surprised if someone comes up with a better autoregressive architecture soon.
Thanks for this post! More than anything I’ve read before, it captures the visceral horror I feel when contemplating AGI, including some of the supposed FAIs I’ve seen described (though I’m not well-read on the subject).
One thought though: the distinction between wrapper-minds and non-wrapper-minds does not feel completely clear-cut to me. For instance, consider a wrapper-mind whose goal is to maximize the number of paperclips, but rather than being given a hard-coded definition of “paperclip”, it is instructed to go out into the world, interact with humans, and learn about paperclips that way. In doing so, it learns (perhaps) that a paperclip is not merely a piece of metal bent into a particular shape, but is something that humans use to attach pieces of paper to one another. And so, in order to maximize the number of paperclips, it needs to make sure that both humans and paper continue to exist. And if, for instance, people started wanting to clip together more sheets of paper at a time, the AI might be able to notice this and start making bigger paperclips, because to it, the bigger ones would now be “more paperclip-y”, since they are better able to achieve the desired function of a paperclip.
I’m not saying this AI is a good idea. In fact, it seems like it would be a terrible idea, because it’s so easily gameable; all people need to do is start referring to something else as “paperclips” and the AI will start maximizing that thing instead.
My point is more just to wonder: does this hypothetical concept-learning paperclip maximizer count as a “wrapper-mind”?
Aha, thanks for clarifying this; was going to ask this too. :)
I’d be interested to see links to those papers!