I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, LinkedIn, and more at my website.
Steven Byrnes
I have a simple model with toy examples of where non-additivity in personality and other domains comes from, see §4.3.3 here.
You can provoke intense FEAR responses in animals by artificially stimulating the PAG, anterior-medial hypothalamus, or anterior-medial amygdala. Low-intensity stimulation causes freezing; high-intensity stimulation causes fleeing. We know that stimulating animals in the PAG is causing them subjective distress and not just motor responses because they will strenuously avoid returning to places where they were once PAG-stimulated.
I think this is a bit misleading. Maybe this wasn’t understood when Panksepp was writing in 2004, but I think it’s clear today that we should think of PAG as kind of a “switchboard”—lots of little side-by-side cell groups that each triggers a different innate motor and/or autonomic behavior. If you stimulate one little PAG cell group, the animal will laugh; move a microscopic amount over to the next little PAG cell group, and it makes the animal flinch; move another microscopic amount over and the animal will flee, etc. [those are slightly-made-up examples, not literal, because I’m too lazy to look it up right now].
So then these old experiments where someone “stimulates PAG” would amount to basically mashing all the buttons on the switchboard at the same time. It creates some complicated mix of simultaneous reactions, depending on mutual inhibition etc. The net result for PAG turns out to be basically a fear reaction, I guess. But that’s not particularly indicative of how we should think about PAG.
Ditto for much or all of the hypothalamus, amygdala, septum, VTA, and more. Each of them has lots of little nearby cell groups that are quite different. Gross stimulation studies (as opposed to narrowly-targeted optogenetic studies) are interesting data-points, but they shouldn’t be interpreted as indicating broadly what that part of the brain does.
Complex social emotions like “shame” or “xenophobia” may indeed be human-specific; we certainly don’t have good ways to elicit them in animals, and (at the time of the book but probably also today) we don’t have well-validated neural correlates of them in humans either. In fact, we probably shouldn’t even expect that there are special-purpose brain systems for cognitively complex social emotions. Neurotransmitters and macroscopic brain structures evolve over hundreds of millions of years; we should only expect to see “special-purpose hardware” for capacities that humans share with other mammals or vertebrates.
I more-or-less agree; shameless plug for my post Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions related to that. In particular, summary of what I believe (from §2.3):
…We have a repertoire of probably hundreds of human-universal “innate behaviors”. Examples of innate behaviors presumably include things like vomiting, “disgust reaction(s)”, “laughing”, “Duchenne smile”, and so on.
Different innate behaviors are associated with specific, human-universal, genetically-specified groups of neurons in the hypothalamus and brainstem.
These innate behaviors can involve the execution of specific and human-universal facial expressions, body postures, and/or other physiological changes like cortisol release.
These innate behaviors have human-universal triggers (and suppressors). But it’s hard to describe exactly what those triggers are, because the triggers are internal signals like “Signal XYZ going into the hypothalamus and brainstem”, not external situations like “getting caught breaking the rules”.
…
Separately, we also have a bunch of concepts relating to emotion—things like “guilt” and “schadenfreude”, as those words are used in everyday life. These concepts are encoded as patterns of connections in the cortex (including hippocampus) and thalamus, and they form part of our conscious awareness.
Emotion concepts typically relate to both innate behaviors and the situation in which those behaviors are occurring. For example, Ekman says “surprise” versus “fear” typically involve awfully similar facial expressions, which raises the possibility that the very same innate behaviors tend to activate in both “surprise” and “fear”, and that we instead distinguish “surprise” from “fear” based purely on context.
…the relation between emotion concepts and concurrent innate behaviors is definitely not 1-to-1! But it’s definitely not “perfectly uncorrelated” either!
Emotion concepts, like all other concepts, are at least somewhat culturally-dependent, and are learned within a lifetime, and play a central role in how we understand and remember what’s going on.
Shame and xenophobia are obviously emotion concepts. But do they have any straightforward correspondence to innate behaviors? For xenophobia, I’d guess “mostly no”—I think there are innate behaviors related to responding to enemies in general, and when those behaviors get triggered in a certain context we call it xenophobia. For shame, I dunno.
Also, shameless plug for Neuroscience of human social instincts: a sketch, in which I attempt to reverse-engineer the set of “innate behaviors” related to human compassion, spite, and status-seeking.
I think parts of the brain are non-pretrained learning algorithms, and parts of the brain are not learning algorithms at all, but rather innate reflexes and such. See my post Learning from scratch in the brain for justification.
I tend to associate “feeling the AGI” with being able to make inferences about the consequences of AGI that are not completely idiotic.
Are you imagining that AGI means that Claude is better and that some call center employees will lose their jobs? Then you’re not feeling the AGI.
Are you imagining billions and then trillions of autonomous brilliant entrepreneurial agents, plastering the Earth with robot factories and chip factories and solar cells? Then you’re feeling the AGI.
Are you imagining a future world, where the idea of a human starting a company or making an important government decision, is as laughably absurd as the idea of a tantrum-prone kindergartener starting a company or making an important government decision? Then you’re feeling the AGI.
The economists who forecast that AGI will cause GDP growth to increase by less than 50 percentage points, are definitely not feeling the AGI. Timothy B. Lee definitely does not feel the AGI. I do think there are lots of people who “feel the AGI” in the sense of saying things about the consequences of AGI that are not completely, transparently idiotic, but who are still wrong about the consequences of AGI. Feeling the AGI is a low bar! Actually getting it right is much harder! …At least, that’s how I interpret the term “feel the AGI”.
FWIW I’m also bearish on LLMs but for reasons that are maybe subtly different from OP. I tend to frame the issue in terms of “inability to deal with a lot of interconnected layered complexity in the context window”, which comes up when there’s a lot of idiosyncratic interconnected ideas in one’s situation or knowledge that does not exist on the internet.
This issue incidentally comes up in “long-horizon agency”, because if e.g. you want to build some new system or company or whatever, you usually wind up with a ton of interconnected idiosyncratic “cached” ideas about what you’re doing and how, and who’s who, and what’s what, and what are the idiosyncratic constraints and properties and dependencies in my specific software architecture, etc. The more such interconnected bits of knowledge that I need for what I’m doing—knowledge which is by definition not on the internet, and thus must be in the context window instead—the more I expect foundation models to struggle on those tasks, now and forever.
But that problem is not exactly the same as a problem with long-horizon agency per se. I would not be too surprised or updated by seeing “long-horizon agency” in situations where, every step along the way, pretty much everything you need to know to proceed, is on the internet.
More concretely, suppose there’s a “long-horizon task” that’s very tag-team-able—i.e., Alice has been doing the task, but you could at any moment fire Alice, take a new smart generalist human off the street, Bob, and then Bob would be able continue the task smoothly after very little time talking to Alice or scrutinizing Alice’s notes and work products. I do think there are probably tag-team-able “long-horizon tasks” like that, and expect future foundation models to probably be able to do those tasks. (But I don’t think tag-team-able long-horizon tasks are sufficient for TAI.)
This is also incidentally how I reconcile “foundation models do really well on self-contained benchmark problems, like CodeForces” with “foundation models are not proportionally performant on complex existing idiosyncratic codebases”. If a problem is self-contained, that puts a ceiling on the amount of idiosyncratic layered complexity that needs to be piled into the context window.
Humans, by comparison, are worse than foundation models at incorporating idiosyncratic complexity in the very short term (seconds and minutes), but the sky’s the limit if you let the human gain familiarity with an idiosyncratic system or situation over the course of days, weeks, months.
Good questions, thanks!
In 2.4.2, you say that things can only get stored in episodic memory if they were in conscious awareness. People can sometimes remember events from their dreams. Does that mean that people have conscious awareness during (at least some of) their dreams?
My answer is basically “yes”, although different people might have different definitions of the concept “conscious awareness”. In other words, in terms of map-territory correspondence, I claim there’s a phenomenon P in the territory (some cortex neurons / concepts / representations are active at any given time, and others are not, as described in the post), and this phenomenon P gets incorporated into everyone’s map, and that’s what I’m talking about in this post. And this phenomenon P is part of the territory during dreaming too.
But it’s not necessarily the case that everyone will define the specific English-language phrase “conscious awareness” to indicate the part of their map whose boundaries are drawn exactly around that phenomenon P. Instead, for example, some people might feel like the proper definition of “conscious awareness” is something closer to “the phenomenon P in the case when I’m awake, and not drugged, etc.”, which is really P along with various additional details and connotations and associations, such as the links to voluntary control and memory. Those people would still be able to conceptualize the phenomenon P, of course, and it would still be a big part of their mental worlds, but to point to it you would need a whole sentence, not just the two words “conscious awareness”.
Is there anything you can say about what unconsciousness is? i.e. Why is there nothing in conscious awareness during this state? - Is the cortex not thinking any (coherent?) thoughts? (I have not studied unconsciousness.)
I think sometimes the cortex isn’t doing much of anything, or at least, not running close-enough-to-normal that neurons representing thoughts can be active.
Alternatively, maybe the cortex is doing its usual thing of activating groups of neurons that represent thoughts and concepts—but it’s neither forming memories (that last beyond a few seconds), nor taking immediate actions. Then you “look unconscious” from the outside, and you also “look unconscious” from the perspective of your future self. There’s no trace of what the cortex was doing, even if it was doing something. Maybe brain scans can distinguish that possibility though.
About the predictive learning algorithm in the human brain…
I think I declare that out-of-scope for this series, from some combination of “I don’t know the complete answer” and “it might be a dangerous capabilities question”. Those are related, of course—when I come upon things that might be dangerous capabilities questions, I often don’t bother trying to answer them :-P
should I think of it as: two parameters having similarly high Bayesian posterior probability, but the brain not explicitly representing this posterior, instead using something like local hill climbing to find a local MAP solution—bistable perception corresponding to the two different solutions this process converges to?
Yup, sounds right.
to what extent should I interpret the brain as finding a single solution (MLE/MAP) versus representing a superposition or distribution over multiple solutions (fully Bayesian)?
I think it can represent multiple possibilities to a nonzero but quite limited extent; I think the superposition can only be kinda local to a particular subregion of the cortex and a fraction of a second. I talk about that a bit in §2.3.
in which context should I interpret the phrase “the brain settling on two different generative models”
I wrote “your brain can wind up settling on either of [the two generative models]”, not both at once.
…Not sure if I answered your question.
[CLAIMED!]
ODD JOB OFFER: I think I want to cross-post Intro to Brain-Like-AGI Safety as a giant 200-page PDF on arxiv (it’s about 80,000 words), mostly to make it easier to cite (as is happening sporadically, e.g. here, here). I am willing to pay fair market price for whatever reformatting work is necessary to make that happen, which I don’t really know (make me an offer). I guess I’m imagining that the easiest plan would be to copy everything into Word (or LibreOffice Writer), clean up whatever formatting weirdness comes from that, and convert to PDF. LaTeX conversion is also acceptable but I imagine that would be much more work for no benefit.I think inline clickable links are fine on arxiv (e.g. page 2 here), but I do have a few references to actual papers, and I assume those should probably be turned into a proper reference section at the end of each post / “chapter”. Within-series links (e.g. a link from Post 6 to a certain section of Post 4) should probably be converted to internal links within the PDF, rather than going out to the lesswrong / alignmentforum version. There are lots of textbooks / lecture notes on arxiv which can serve as models; I don’t really know the details myself. The original images are all in Powerpoint, if that’s relevant. The end product should ideally be easy for me to edit if I find things I want to update. (Arxiv makes updates very easy, one of my old arxiv papers is up to version 5.)
…Or maybe this whole thing is stupid. If I want it to be easier to cite, I could just add in a “citation information” note, like they did here? I dunno.
Davidad responds with a brief argument for 1000 FLOP-equivalent per synapse-second (3 OOM more than my guess) on X as follows:
Ok, so assuming we agree on 1e14 synapses and 3e8 seconds, then where we disagree is on average FLOP(-equivalent) per synapse-second: you think it’s about 1, I think it’s about 1000. This is similar to the disagreement you flagged with Joe Carlsmith.
Note: at some point Joe interviewed me about this so there might be some double-counting of “independent” estimates here, but iirc he also interviewed many other neuroscientists.
My estimate would be a lot lower if we were just talking about “inference” rather than learning and memory. STDP seems to have complex temporal dynamics at the 10ms scale.
There also seem to be complex intracellular dynamics at play, possibly including regulatory networks, obviously regarding synaptic weight but also other tunable properties of individual compartments.
The standard arguments for the causal irrelevance of these to cognition (they’re too slow to affect the “forward pass”) don’t apply to learning. I’m estimating there’s like a 10-dimensional dynamical system in each compartment evolving at ~100Hz in importantly nonlinear ways.
I think OP is using “sequential” in an expansive sense that also includes e.g. “First I learned addition, then I learned multiplication (which relies on already understanding addition), then I learned the distributive law (which relies on already understanding both addition and multiplication), then I learned the concept of modular arithmetic (which relies on …) etc. etc.” (part of what OP calls “C”). I personally wouldn’t use the word ‘sequential’ for that—I prefer a more vertical metaphor like ‘things building upon other things’—but that’s a matter of taste I guess. Anyway, whatever we want to call it, humans can reliably do a great many steps, although that process unfolds over a long period of time.
…And not just smart humans. Just getting around in the world, using tools, etc., requires giant towers of concepts relying on other previously-learned concepts.
Obviously LLMs can deal with addition and multiplication and modular arithmetic etc. But I would argue that this tower of concepts building on other concepts was built by humans, and then handed to the LLM on a silver platter. I join OP in being skeptical that LLMs (including o3 etc.) could have built that tower themselves from scratch, the way humans did historically. And I for one don’t expect them to be able to do that thing until an AI paradigm shift happens.
Self-dialogue: Do behaviorist rewards make scheming AGIs?
In case anyone missed it, I stand by my reply from before— Applying traditional economic thinking to AGI: a trilemma
If you offer a salary below 100 watts equivalent, humans won’t accept, because accepting it would mean dying of starvation. (Unless the humans have another source of wealth, in which case this whole discussion is moot.) This is not literally a minimum wage, in the conventional sense of a legally-mandated wage floor; but it has the same effect as a minimum wage, and thus we can expect it to have the same consequences as a minimum wage.
This is obviously (from my perspective) the point that Grant Slatton was trying to make. I don’t know whether Ben Golub misunderstood that point, or was just being annoyingly pedantic. Probably the former—otherwise he could have just spelled out the details himself, instead of complaining, I figure.
It was Grant Slatton but Yudkowsky retweeted it
I like reading the Sentinel email newsletter once a week for time-sensitive general world news, and https://en.wikipedia.org/wiki/2024 (or https://en.wikipedia.org/wiki/2025 etc.) once every 3-4 months for non-time-sensitive general world news. That adds up to very little time—maybe ≈1 minute per day on average—and I think there are more than enough diffuse benefits to justify that tiny amount of time.
I feel like I’ve really struggled to identify any controllable patterns in when I’m “good at thinky stuff”. Gross patterns are obvious—I’m reliably great in the morning, then my brain kinda peters out in the early afternoon, then pretty good again at night—but I can’t figure out how to intervene on that, except scheduling around it.
I’m extremely sensitive to caffeine, and have a complicated routine (1 coffee every morning, plus in the afternoon I ramp up from zero each weekend to a full-size afternoon tea each Friday), but I’m pretty uncertain whether I’m actually getting anything out of that besides a mild headache every Saturday.
I wonder whether it would be worth investing the time and energy into being more systematic to suss out patterns. But I think my patterns would be pretty subtle, whereas yours sound very obvious and immediate. Hmm, is there an easy and fast way to quantify “CQ”? (This pops into my head but seems time-consuming and testing the wrong thing.) …I’m not really sure where to start tbh.
…I feel like what I want to measure is a 1-dimensional parameter extremely correlated with “ability to do things despite ugh fields”—presumably what I’ve called “innate drive to minimize voluntary attention control” being low a.k.a. “mental energy” being high. Ugh fields are where the parameter is most obvious to me but it also extends into thinking well about other topics that are not particularly aversive, at least for me, I think.
Sorry if I missed it, but you don’t seem to address the standard concern that mildly-optimizing agents tend to self-modify into (or create) strongly-optimizing agents.
For example (copying from my comment here), let’s say we make an AI that really wants there to be exactly 100 paperclips in the bin. There’s nothing else it wants or desires. It doesn’t care a whit about following human norms, etc.
But, there’s one exception: this AI is also “lazy”—every thought it thinks, and every action it takes, is mildly aversive. So it’s not inclined to, say, build an impenetrable fortress around the bin just for an infinitesimal probability increment. “Seems like a lot of work! It’s fine as is,” says the AI to itself.
But hey, here’s something it can do: rent some server time on AWS, and make a copy its own source code and trained model, but comment out the “laziness” code block. That’s not too hard; even a “lazy” AI would presumably be capable of doing that. And the result will be a non-lazy AI that works tirelessly and uncompromisingly towards incrementing the probability of there being 32 paperclips—first 99.99%, then 99.9999%, etc. That’s nice! (from the original AI’s perspective). Or more specifically, it offers a small benefit for zero cost (from the original AI’s perspective).
It’s not wildly different from a person saying “I want to get out of debt, but I can’t concentrate well enough to hold down a desk job, so I’m going to take Adderall”. It’s an obvious solution to a problem.
…OK, in this post, you don’t really talk about “AI laziness” per se, I think, instead you talk about “AI getting distracted by other things that now seem to be a better use of its time”, i.e. other objectives. But I don’t think that changes anything. The AI doesn’t have to choose between building an impenetrable fortress around the bin of paperclips versus eating lunch. “Why not both?”, it says. So the AI eats lunch while its strongly-optimizing subagent simultaneously builds the impenetrable fortress. Right?
I’m still curious about how you’d answer my question above. Right now, we don’t have ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”.
If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to ASI. So we might as well sit tight and find other useful things to do, until such time as the AI capabilities researchers figure it out.
…Furthermore, I don’t think the number of months or years between “AIs that are ‘better at extrapolating’” and ASI is appreciably larger if the “AIs that are ‘better at extrapolating’” arrive tomorrow, versus if they arrive in 20 years. In order to believe that, I think you would need to expect some second bottleneck standing between “AIs that are ‘better at extrapolating’”, and ASI, such that that second bottleneck is present today, but will not be present (as much) in 20 years, and such that the second bottleneck is not related to “extrapolation”.
I suppose that one could argue that availability of compute will be that second bottleneck. But I happen to disagree. IMO we already have an absurdly large amount of compute overhang with respect to ASI, and adding even more compute overhang in the coming decades won’t much change the overall picture. Certainly plenty of people would disagree with me here. …Although those same people would probably say that “just add more compute” is actually the only way to make AIs that are “better at extrapolation”, in which case my point would still stand.
I don’t see any other plausible candidates for the second bottleneck. Do you? Or do you disagree with some other part of that? Like, do you think it’s possible to get all the way to ASI without ever making AIs “better at extrapolating”? IMO it would hardly be worthy of the name “ASI” if it were “bad at extrapolating” :)
You’re not the first to complain about my terminology here, but nobody can tell me what terminology is right. So, my opinion is: “No, it’s the genetics experts who are wrong” :)
If you take some stupid outcome like “a person’s fleep is their grip strength raised to the power of their alcohol tolerance”, and measure fleep across a population, you will obviously find that there’s a strong non-additive genetic contribution to that outcome. A.k.a. epistasis. If you want to say “no, that’s not really non-additive, and it’s not really epistasis, it’s just that ‘fleep’ is a damn stupid outcome to analyze”, then fine, but then the experts really need to settle on a standardized technical term for “damn stupid outcomes to analyze”, and then need to consider the possibility that pretty much every personality trait and mental health diagnosis (among other things) is a “damn stupid outcome” in much the same way.
I do hope to unravel the deep structure of personality variation someday! In particular, what exactly are the linear “traits” that correspond directly to brain algorithm settings and hyperparameters? (See A Theory of Laughter and Neuroscience of human social instincts: a sketch for the very early stages of that. Warning: long.)
I guess a generic answer would be: the path FROM brain algorithm settings and hyperparameters TO decisions and preferences passes through a set of large-scale randomly-initialized learning algorithms churning away for a billion seconds. (And personality traits are basically decisions and preferences—see examples here.) That’s just a massive source of complexity, obfuscating the relationship between inputs and outputs.
A kind of analogy is: if you train a set of RL agents, each with slightly different reward functions, their eventual behavior will not be smoothly variant. Instead there will be lots of “phase shifts” and such.
So again, we have “a set of large-scale randomly-initialized learning algorithms that run for a billion seconds” on the pathway from the genome to (preferences and decisions). And there’s nothing like that on the pathway from the genome to more “physical” traits like blood pressure. Trained models are much more free to vary across an extremely wide, open-ended, high-dimensional space, compared to biochemical developmental pathways.
See also Heritability, Behaviorism, and Within-Lifetime RL: “As adults in society, people gradually learn patterns of thought & behavior that best tickle their innate internal reward function as adults in society.”
Different people have different innate reward functions, and so different people reliably settle into different patterns of thought & behavior, by the time they reach adulthood. But given a reward function (and learning rate and height and so on), the eventual patterns of thought & behavior that they’ll settle into are reliably predictable, at least within a given country / culture.