Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)
Gunnar_Zarncke
I guess you got downvoted because it sounded like an ad.
But I think Lean Prover is a programming language with a lot of potential for AI alignment and is mentioned in Provably safe systems: the only path to controllable AGI. It would be good to have more knowledge about it on LessWrong.
agents that have preferences about the state of the world in the distant future
What are these preferences? For biological agents, these preferences are grounded in some mechanism—what you call Steering System—that evaluates “desirable states” of the world in some more or less directly measurable way (grounded in perception via the senses) and derives a signal of how desirable the state is, which the brain is optimizing for. For ML models, the mechanism is somewhat different but there is also an input to the training algorithm that determines how “good” the output is. This signal is called reward and drives the system toward outputs that lead to states of high reward. But the path there depends on the specific optimization method and the algorithm has to navigate such a complex loss landscape that it can get stuck in areas of the search space that correspond to imperfect models for very long if not for ever. These imperfect models can be off in significant ways and that’s why it may be useful to say that Reward is not the optimization target.
The connection to Intuitive Self-Models is that even though the internal models of an LLM may be very different from human self-models, I think it is still quite plausible that LLMs and other models form models of the self. Such models are instrumentally convergent. Humans talk about the self. The LLM does things that matches these patterns. Maybe the underlying process in humans that give rise to this is different, but humans learning about this can’t know the actual process either. And in the same way the approximate model the LLM forms is not maximizing the reward signal but can be quite far from it as long it is useful (in the sense of having higher reward than other such models/parameter combinations).
I think of my toenail as “part of myself”, but I’m happy to clip it.
Sure, the (body of the) self can include parts that can be cut/destroyed without that “causing harm” but instead having an overall positive effect. The AI in a compute center would in analogy also consider decommissioning failed hardware. And when defining humanity, we do have to be careful what we mean when these “parts” could be humans.
About conjoined twins and the self:
Krista and Tatiana Hogan (Wikipedia) are healthy functional conjoined craniopagus twins who are joined at the head and share parts of the brain—their thalamus is joined via a thalamic bridge: They can report on preceptions of the other and share affects.
I couldn’t find scientific papers that studied their brain function rigorously, but the paper A Case of Shared Consciousness looks at evidence from documentaries and discusses it. Here are some observational details:
Each is capable of reporting on inputs presented to the other twin’s body. For example, while her own eyes are covered, Tatiana is able to report on visual inputs to both of Krista’s eyes. Meanwhile, Krista can report on inputs to one of Tatiana’s eyes. Krista is able to report and experience distaste towards food that Tatiana is eating (the reverse has not been reported, but may also be true). An often repeated anecdote is that while Tatiana enjoys ketchup on her food, Krista will try to prevent her eating it. Both twins can also detect when and where the other twin’s body is being touched, and their mother reports that they find this easier than visual stimuli.
fMRI imaging revealed that Tatiana’s brain ‘processes signals’ from her own right leg, both her arms, and Krista’s right arm (the arm on the side where they connect). Meanwhile Krista’s brain processes signals from her own left arm, both her own legs and Tatiana’s left leg (again on the side where they connect). Each twin is able to voluntarily move each of the limbs corresponding to these signals.
The twins are also capable of voluntary bodily control for all the limbs within their ordinary body plans. As their mother Felicia puts it, “they can choose when they want to do it, and when they don’t want to do it.”
The twins also demonstrate a common receptivity to pain. When one twin’s body is harmed, both twins cry.
The twins report that they talk to each other in their heads. This had previously been suspected by family members due to signs of apparent collusion without verbalisation.Popular article How Conjoined Twins Are Making Scientists Question the Concept of Self contains many additional interesting bits:
when a pacifier was placed in one infant’s mouth, the other would stop crying.
About the self:
Perhaps the experience of being a person locked inside a bag of skin and bone—with that single, definable self looking out through your eyes—is not natural or given, but merely the result of a changeable, mechanical arrangement in the brain. Perhaps the barriers of selfhood are arbitrary, bendable. This is what the Hogan twins’ experience suggests. Their conjoined lives hint at myriad permeations of the bodily self.
About qualia:
Tatiana senses the greeniness of Krista’s experience all the time. “I hate it!” she cries out, when Krista tastes some spinach dip.
(found via FB comment)
A much smaller subset was also published here, but does include documents:
https://www.techemails.com/p/elon-musk-and-openai?r=1jki4r
Instrumental power-seeking might be less dangerous if the self-model of the agent is large and includes individual humans, groups, or even all of humanity and if we can reliably shape it that way.
It is natural for humans to for form a self-model that is bounded by the body, though it is also common to be only the brain or the mind, and there are other self-models. See also Intuitive Self-Models.
It is not clear what the self-model of an LLM agent would be. It could be
the temporary state of the execution of the model (or models),
the persistently running model and its memory state,
the compute resources (CPU/GPU/RAM) allocated to run the model and its collection of support programs,
the physical compute resources in some compute center(s),
the compute center as an organizational structure that includes the staff to maintain and operate not only the machines but also the formal organization (after all, without that, the machines will eventually fail), or
dito but including all the utilities and suppliers to continue to operate it.
There is not as clear a physical boundary as in the human case. But even in the human case, esp. babies depend on caregivers to a large degree.
There are indications that we can shape the self-model of LLMs: Self-Other Overlap: A Neglected Approach to AI Alignment
This sounds related to my complaint about the YUDKOWSKY + WOLFRAM ON AI RISK debate:
I wish there had been some effort to quantify @stephen_wolfram’s “pockets or irreducibility” (section 1.2 & 4.2) because if we can prove that there aren’t many or they are hard to find & exploit by ASI, then the risk might be lower.
I got this tweet wrong. I meant if pockets of irreducibility are common and non-pockets are rare and hard to find, then the risk from superhuman AI might be lower. I think Stephen Wolfram’s intuition has merit but needs more analysis to be convicing.
There are two parts to the packaging that you have mentioned:
optimizing transport (not breaking the TV) is practical and involves everything but the receiver
enhancing reception (nice present wrapping) is cultural and involves the receiver(s)
Law of equal (or not so equal) opposite advice: The are some—probably few—flaws that you can keep because they are small and not worth the effort to fix or make you more lovable and unique.
Example:
I’m a very picky eater. No sauces, no creams, no spicy foods. Lots of things excluded. It limits what i can eat and i always have to explain.
But don’t presume any flaw you are attached to falls into this category. I’m also not strongly convinced of this.
a lot of the current human race spends a lot of time worrying—which I think probably has the same brainstorming dynamic and shares mechanisms with the positively oriented brainstorming. I don’t know how to explain this; I think the avoidance of bad outcomes being a good outcome could do this work, but that’s not how worrying feels—it feels like my thoughts are drawn toward potential bad outcomes even when I have no idea how to avoid them yet.
If we were not able to think about potentially bad outcomes well, that would a problem as clearly thinking about them is what avoids them, hopefully. But the question is a good one. My first intuition was that maybe the importance of an outcome—in both directions, good and bad—is relevant.
I like the examples from 8.4.2:
Note the difference between saying (A) “the idea of going to the zoo is positive-valence, a.k.a. motivating”, versus (B) “I want to go to the zoo”. [...]
Note the difference between saying (A) “the idea of closing the window popped into awareness”, versus (B) “I had the idea to close the window”. Since (B) involves the homunculus as a cause of new thoughts, it’s forbidden in my framework.
I think it could be an interesting mental practice to rephrase inner speech involving “I” in this way. I have been doing this for a while now. It started toward the end of my last meditation retreat when I switched to a non-CISM (or should I say “there was a switch in the thoughts about self-representation”?). Using “I” in mental verbalization felt like a syntax error and other phrasings like you are suggesting here, felt more natural. Interestingly, it still makes sense to use “I” in conversations to refer to me (the speaker). I think that is part of why the CISM is so natural: It uses the same element in internal and external verbalizations[1].
Pondering your examples, I think I would render them differently. Instead of: “I want to go to the zoo,” it could be: “there is a desire to go to the zoo.” Though I guess if “desire to” stands for “positive-valence thought about”) it is very close to your “the idea of going to the zoo is positive-valence.”
In practice, the thoughts would be smaller, more like “there is [a sound][2],” “there is a memory of [an animal],” “there is a memory of [an episode from a zoo visit],” “there is a desire to [experience zoo impressions],” “there is a thought of [planning].” The latter gets complicated. The thought of planning could be positive valence (because plans often lead to desirable outcomes) or the planning is instrumentally useful to get the zoo impressions (which themselves may be associated with desirable sights and smells), or the planning can be aversive (because effortful), but still not strong enough to displace the desirable zoo visit.
For an experienced meditator, the fragments that can be noticed can be even smaller—or maybe more pre-cursor-like. This distinction is easier to see with a quiet mind, where, before a thought fully occupies attention, glimmers of thoughts may bubble up[3]. This is related to noticing that attention is shifting. The everyday version of that happens why you notice that you got distracted by something. The subtler form is noticing small shifts during your regular thinking (e.g., I just noticed my attention shifting to some itch, without that really interuping my writing flow). But I’m not sure how much of that is really a sense of attention vs. a retroactive interpretation of the thoughts. Maybe a more competent meditator can comment.
- ^
And now I wonder whether the phonological loop, or whatever is responsible for language-like thoughts, maybe subvocalizations, is what makes the CISM the default model.
- ^
[brackets indicate concepts that are described by words, not the words themselves]
- ^
The question is though, what part notices the noticing. Some thought of [noticing something] must be sufficiently stable and active to do so.
I think your explanation in section 8.5.2 resolves our disagreement nicely. You refer to S(X) thoughts that “spawn up” successive thoughts that eventually lead to X (I’d say X’) actions shortly after (or much later). While I was referring to S(X) that cannot give rise to X immediately. I think the difference was that you are more lenient with what X can be, such that S(X) can be about an X that is happening much later, which wouldn’t work in my model of thoughts.
Explicit (self-reflective) desire
Statement: “I want to be inside.”
Intuitive model underlying that statement: There’s a frame (§2.2.3) “X wants Y” (§3.3.4). This frame is being invoked, with X as the homunculus, and Y as the concept of “inside” as a location / environment.
How I describe what’s happening using my framework: There’s a systematic pattern (in this particular context), call it P, where self-reflective thoughts concerning the inside, like “myself being inside” or “myself going inside”, tend to trigger positive valence. That positive valence is why such thoughts arise in the first place, and it’s also why those thoughts tend to lead to actual going-inside behavior.
In my framework, that’s really the whole story. There’s this pattern P. And we can talk about the upstream causes of P—something involving innate drives and learned heuristics in the brain. And we can likewise talk about the downstream effects of P—P tends to spawn behaviors like going inside, brainstorming how to get inside, etc. But “what’s really going on” (in the “territory” of my brain algorithm) is a story about the pattern P, not about the homunculus. The homunculus only arises secondarily, as the way that I perceive the pattern P (in the “map” of my intuitive self-model).
As I commented on Are big brains for processing sensory input? I predict that the brain regions of a whale or Orca responsible for spatiotemporal learning and memory are a big part of their encephalization.
[Linkpost] Building Altruistic and Moral AI Agent with Brain-inspired Affective Empathy Mechanisms
I’m not disagreeing with this assessment. The author has an agenda, but I don’t think it’s hidden in any way. It is mostly word thinking and social association. But that’s how the opposition works!
I believe this has been done in Google’s Multilingual Neural Machine Translation (GNMT) system that enables zero-shot translations (translating between language pairs without direct training examples). This system leverages shared representations across languages, allowing the model to infer translations for unseen language pairs.
The above link posted is a lengthy and relatively well-sourced, if biased, post about Scott Alexander’s writing related to human biodiversity (HBD). The author is very clearly opposed to HBD. I think it is a decent read if you want to understand that position.
Thanks. I already got in touch with Masaharu Mizumoto.
Congrats again for the sequence! It all fits together nicely.
While it makes sense to exclude hallucinogenic drugs and seizures, at least hallucinogenic drugs seem to fit into the pattern if I understand the effect correctly.
Auditory hallucinations, top-down processing and language perception—this paper says that imbalances in top-down cortical regulation is responsible for auditory hallucinations:
Participants who reported AH in the week preceding the test had a higher false alarm rate in their auditory perception compared with those without such (recent) experiences.
And this page Models of psychedelic drug action: modulation of cortical-subcortical circuits says that hallucinogenic drugs lead to such imbalances. So it is plausibly the same mechanism.
Scott Alexander for psychiatry and drugs and many other topics
Paul Graham for startups specifically, but his essays cover a much wider space
Scott Adams for persuasion, humor, and recently a lot of political commentary—not neutral; he has his own agendas
Robin Hanson—Economics, esp. long-term, very much out-of-distribution thinking
Zvi Mowshowitz for AI news (and some other research-heavy topics; previously COVID-19)
I second Patrick McKenzie.
Maybe create a Quotes Thread post with the rule that quotes have to be upvoted and if you like them you can add a react.