Htarlov comments on My AI Model Delta Compared To Yudkowsky

Htarlov 17 Jun 2024 9:24 UTC
9 points
3
I also think that the natural abstraction hypothesis holds with current AI. The architecture of LLMs is based on the capability of modeling ontology in terms of vectors in space of thousands of dimensions and there are experiments that show it generalizes and has somewhat interpretable meanings to the directions in that space. (even if not easy to interpret to the scale above toy models). Like in that toy example when you take the embedding vector of the word “king”, subtract the vector of “man”, add the vector of “woman” and you land near the position of “queen” in the space. LLM is based on those embedding spaces but also makes operations that direct focus and modify positions in that space (hence “transformer”) by meaning and information taken from other symbols in context (simplifying here). There are basically neural network layers that tell which word should have an impact on the meaning of other words in the text (weight) and layers that apply that change with some modifications. This being learned on the human texts internalizes our symbols, relations, and whole ontology (in broad terms of our species ontology—parts common to us all and different possibilities that happen in reality and in fiction).
Even if NAH doesn’t need to hold in general, I think in the case of LLM it holds.
Nevertheless, I see there is a different problem with LLM. That is, those models seem to me basically goalless but easily directed towards any goal. Meaning they are not based on the ontology of a single human mind and don’t internalize only a single certain morality and set of goals. They generalize over the whole human species ontology and a whole space of possible thoughts that can be made based on that. They also generalize over hypothetical and fictitious space, not only real humans in particular.
Human minds are all from some narrow area of space of possible minds and through our evolution and how we are raised usually we have certain stable and similar models of ethics, morality, and somewhat similar goals in life (except in some rare extreme cases). We sometimes entertain different possibilities, and we create movies and books with villains, but in circumstances of real decisions and not thought experiments or entertainment—we are similar. What LLM does is it generalizes into a much broader space and does not have any hard base there. So even if the ontology matches and even if LLM is barely capable of creating new concepts not fitting human ontology at the current level, the model is much broader in terms of goals and ways of processing over that general ontology. It basically has not one certain ontology, but a whole spectrum like a good actor that can take any role, but to the extreme. In other terms, it can “think*” similar thoughts, that are understandable by a human in the end, even if not that quickly, internally also have vectors that correspond to our ontology, but also can easily produce thoughts that no real sane human would have. Also, it has hardly any internal goals and none are stable. We have certain beliefs that are not based on objective facts and ontology, but we still believe them because we are who we are. We are not goal-less agents and it is hard to change our terminal goals in a meaningful way. For LLM goals are modeled by training into “default state” (being helpful etc.) and are stated as “system prompts” that are stated/repeated for it to base upon are part of the context that anchors it into some part of that very vast space of minds it can “emulate”. So LLM might be helpful and friendly by default, but If you tell it to simulate being a murderbot, it will. Additional training phases might make it harder to start it in that direction, but won’t totally remove that from the space of ways it can operate. This removes only some paths in that multidimensional space. Jailbreaking of GPTs shows that it’s possible to find other more complex paths around.
What is even more dangerous for me is that LLMs are already above human levels in some aspects—it just does not show yet because it is learned to emulate our ways of thinking and similar (the big area around it in the space of possible ways of thinking and possible goals, but not too alien, still can be grasped by humans).
We are capable of processing about 7 “symbols” at once in working memory (a few more in some cases). It might be a few dozen more if we take into account long-term memory and how we take context from it. This first number is taken from neurobiology literature (aka “The Magical Number Seven, Plus or Minus Two”), and the second one is an educated guess. This is the context window on which we work. That is nothing in comparison to LLM which can have a whole small book as its current working context, which means in principle it can process and create much much more complex thoughts. It does not do that because our text never does that and it learns to generalize over our capabilities. Nevertheless, in principle, it could and we might see it in action if we start of process of learning LLM on top of the output of LLM in closed loops. It might easily go beyond the space of our capabilities and complexity that is easily understandable to us (I don’t say it won’t be understandable, but we might need time to understand and might never grasp it as a whole without dividing it into less complex parts—like we can take compiled assembler code and organize it into meaningful functions with few levels of abstractions that we are able to understand).
* “think” in analogy as the process of thinking is different, but also has some similarities