I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
Steven Byrnes
My view is that the “constrict blood vessels and tense muscles” action (or whatever it is) is less like moving your finger, and more like speeding up your heart rate: sorta consciously controllable but not by a simple and direct act of willful control. I personally was talking to my hands rather than talking to my subconscious, but whatever, either way, I see it as a handy trick to send out the right nerve signals. Again like how if you want to release adrenaline, you think of something scary, you don’t think “Adrenal gland, Activate!” (Unless you’ve specifically practiced.)
I guess where I differ in emphasis from you is that I like to talk about how an important part of the action is really happening at the location of the pain, even if the cause is in the brain. I find that people talking about “psychosomatic” tend to be cutting physiology out of the loop altogether, though you didn’t quite say that yourself. The other different emphasis is whether there’s any sense whatsoever in which some part of the person wants the pain to happen because of some ulterior motive. I mean, that kind of story very much did not resonate with my experience. My RSI flare-ups were always pretty closely associated with using my hands. I guess I shouldn’t over-generalize from my own experience. Shrug.
“Constricting blood vessels” seems like a broad enough mechanism to be potentially applicable to back spasms, RSI, IBS, ulcers, and all the other superficially different indications we’ve all heard of. But I don’t know much about physiology or vasculature, and I don’t put too much stock in that exact description. Could also be something about nerves I guess?
This paper replaces a normal feedforward image classifier with a mesa-optimizing one (build generative models of different possibilities and pick the one that best matches the data). The result was better and far more human-like than a traditional image classifier, e.g. the same examples are ambiguous to the model that are ambiguous to humans and vice-versa. I also understand that the human brain is very big into generative modeling of everything. So I expect that ML systems of the future will approach 100% mesa-optimizers, while non-optimizing feedforward NN’s will become rare. This post is a good framework and I’m looking forward to follow-ups!
I haven’t seen anyone argue for CRT the way you describe it. I always thought the argument was that we are concerned about “rational AIs” (I would say more specifically, “AIs that run searches through a space of possible actions, in pursuit of a real-world goal”), because (1) We humans have real-world goals (“cure Alzheimer’s” etc.) and the best way to accomplish a real-world goal is generally to build an agent optimizing for that goal (well, that’s true right up until the agent becomes too powerful to control, and then it becomes catastrophically false), (2) We can try to build AIs that are not in this category, but screw up*, (3) Even if we here all agree to not build this type of agent, it’s hard to coordinate everyone on earth to never do it forever. (See also: Rohin’s two posts on goal-directedness.)
In particular, when Eliezer argued a couple years ago that we should be mainly thinking about AGIs that have real-world-anchored utility functions (e.g. here or here) I’ve always fleshed out that argument as: ”...This type of AGI is the most effective and powerful type of AGI, and we should assume that society will keep making our AIs more and more effective and powerful until we reach that category.”
*(Remember, any AI is running searches through some space in pursuit of something, otherwise you would never call it “intelligence”. So one can imagine that the intelligent search may accidentally get aimed at the wrong target.)
Can you please update on footnotes? I notice people have written posts with footnotes, so it must be possible, but I can’t figure out how. Thanks in advance
1hr talk: Intro to AGI safety
Hi John, Are you saying that there should be more small teams in AGI safety rather than increasing the size of the “big” teams like OpenAI safety group and MIRI? Or are you saying that AGI safety doesn’t need more people period?
Looks like MIRI is primarily 12 people. Does that count as “large”? My impression is that they’re not all working together on exactly the same narrow project. So do they count as 1 “team” or more than one?
The FLI AI grants go to a diverse set of little university research groups. Is that the kind of thing you’re advocating here?
ETA: The link you posted says that a sufficiently small team is: “I’d suspect this to be less than 15 people, but would not be very surprised if this was number was around 100 after all.” If we believe that (and I wouldn’t know either way), then there are no AGI safety teams on Earth that are “too large” right now.
I think I agree with this. The system is dangerous if its real-world output (pixels lit up on a display, etc.) is optimized to achieve a future-world-state. I guess that’s what I meant. If there are layers of processing that sit between the optimization process output and the real-world output, that seems like very much a step in the right direction. I dunno the details, it merits further thought.
I am imagining a flat plain of possible normative systems (goals / preferences / inclinations / whatever), with red zones sprinkled around marking those normative systems which are dangerous. CRT (as I understand it) says that there is a basin with consequentialism at its bottom, such that there is a systematic force pushing systems towards that. I’m imagining that there’s no systematic force.
So in my view (flat plain), a good AI system is one that starts in a safe place on this plain, and then doesn’t move at all … because if you move in any direction, you could randomly step into a red area. This is why I don’t like misaligned subsystems—it’s a step in some direction, any direction, away from the top-level normative system. Then “Inner optimizers / daemons” is a special case of “misaligned subsystem”, in which the random step happened to be into a red zone. Again, CRT says (as I understand it) that a misaligned subsystem is more likely than chance to be an inner optimizer, whereas I think a misaligned subsystem can be an inner optimizer but I don’t specify the probability of that happening.
Leaving aside what other people have said, it’s an interesting question: are there relations between the normative system at the top-level and the normative system of its subsystems? There’s obviously good reason to expect that consequentialist systems will tend to create consequentialist subsystems, and that deontological systems will tend create deontological subsystems, etc. I can kinda imagine cases where a top-level consequentialist would sometimes create a deontological subsystem, because it’s (I imagine) computationally simpler to execute behaviors than to seek goals, and sub-sub-...-subsystems need to be very simple. The reverse seems less likely to me. Why would a top-level deontologist spawn a consequentialist subsystem? Probably there are reasons...? Well, I’m struggling a bit to concretely imagine a deontological advanced AI...
We can ask similar questions at the top-level. I think about normative system drift (with goal drift being a special case), buffeted by a system learning new things and/or reprogramming itself and/or getting bit-flips from cosmic rays etc. Is there any reason to expect the drift to systematically move in a certain direction? I don’t see any reason, other than entropy considerations (e.g. preferring systems that can be implemented in many different ways). Paul Christiano talks about a “broad basin of attraction” towards corrigibility but I don’t understand the argument, or else I don’t believe it. I feel like, once you get to a meta-enough level, there stops being any meta-normative system pushing the normative system in any particular direction.
So maybe the stronger version of not-CRT is: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system, with the exceptions of (1) entropic forces, and (2) programmers literally shutting down the AI, editing the raw source code, and trying again”. I (currently) would endorse this statement. (This is also a stronger form of orthogonality, I guess.)
I wrote this 10,000-word blog post arguing that cold fusion is not real after all, on the basis of the experimental evidence. (The rest of the blog, 30 posts or so, spells out the argument that cold fusion is not real, on the basis of our knowledge of theoretical physics.) Obviously the conclusion is no surprise to most people here … but I still think the nitty-gritty details of these arguments are interesting and are somewhat hard to find elsewhere on the internet.
What are “decision theory self-modification arguments”? Can you explain or link?
Thanks, that was really helpful!! OK, so going back to my claim above: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system”. So far I have six exceptions to this:
If an agent has a “real-world goal” (utility function on future-world-states), we should expect increasingly rational goal-seeking behavior, including discovering and erasing hardcoded irrational behavior (with respect to that goal), as described by dxu. But I’m not counting this as an exception to my claim because the goal is staying the same.
If an agent has a set of mutually-inconsistent goals / preferences / inclinations, it may move around within the convex hull (so to speak) of these goals / preferences / inclinations, as they compete against each other. (This happens in humans.) And then, if there is at least one preference in that set which is a “real-world goal”, it’s possible (though not guaranteed) that that preference will come out on top, leading to (0) above. And maybe there’s a “systematic force” pushing in some direction within this convex hull—i.e., it’s possible that, when incompatible preferences are competing against each other, some types are inherently likelier to win the competition than other types. I don’t know which ones that would be.
In the (presumably unusual) case that an agent has a “self-defeating preference” (i.e. a preference which is likelier to be satisfied by the agent not having that preference, as in dxu’s awesome SHA example), we should expect the agent to erase that preference.
As capybaralet notes, if there is evolution among self-reproducing AIs (god help us all), we can expect the population average to move towards goals promoting evolutionary fitness
Insofar as there is randomness in how agents change over time, we should expect a systematic force pushing towards “high-entropy” goals / preferences / inclinations (i.e., ones that can be implemented in lots of different ways).
Insofar as the AI is programming its successors, we should expect a systematic force pushing towards goals / preferences / inclinations that are easy to program & debug & reason about.
The human programmers can shut down the AI and edit the raw source code.
Agree or disagree? Did I miss any?
[Question] Is AlphaZero any good without the tree search?
One way to think about gradient descent is that if there’s an N-dimensional parameter space on which you build a grid of M^N points and do a grid search for the minimum, well, you can accomplish the same selection task with O(M) steps of gradient descent (at least if there’s no local minimum). So should we say does that M steps of gradient descent gives you O(N log M) bits of optimization? Or something like that? I’m not sure.
In a continuous scenario, AI remains at the same level of capability long enough for us to gain experience with deployed systems of that level, witness small accidents, and fix any misalignment. The slower the scenario, the easier it is to do this. In a moderately discontinuous scenario, there could be accidents that kill thousands of people. But it seems to me that a very strong discontinuity would be needed to get a single moment in which the AI causes an existential catastrophe.
I agree that slower makes the problem easier, but disagree about how slow is slow enough. I have pretty high confidence that a 200-year takeoff is slow enough; faster than that, I become increasingly unsure.
For example: one scenario would be that there are years, even decades, in which worse and worse AGI accidents occur, but the alignment problem is very hard and no one can get it right (or: aligned AGIs are much less powerful and people can’t resist tinkering with the more powerful unsafe designs). As each accident occurs, there’s bitter disagreement around the world about what to do about this problem and how to do it, and everything becomes politicized. Maybe AGI research will be banned in some countries, but maybe it will be accelerated in other countries, on the theory that (for example) smarter systems and better understanding will help with alignment. And thus there would be more accidents and bigger accidents, until sooner or later there’s an existential catastrophe.
I haven’t thought about the issue super-carefully … just a thought …
Jeff Hawkins on neuromorphic AGI within 20 years
According to A Recipe for Training NNs, model ensembles stop being helpful at ~5 models. But that’s when they all have the same inputs and outputs. The more brain-like thing is to have lots of models whose inputs comprise various different subsets of both the inputs and the other models’ outputs.
...But then, you don’t really call it an “ensemble”, you call it a “bigger more complicated neural architecture”, right? I mean, I can take a deep NN and call it “six different models, where the output of model #1 is the input of model #2 etc.”, but no one in ML would say that, they would call it a single six-layer model...
Hmm, it’s true that a traditional RNN can’t imitate the detailed mechanism, but I think it can imitate the overall functionality. (But probably in a computationally inefficient way—multiple time-steps and multiple nodes.) I’m not 100% sure.
Fixed, thanksI
Ten years ago I had a whole miserable year of Repetitive Strain Injury (and other things too) and then had a miraculous one-day recovery after reading Sarno’s book. But I didn’t (and still don’t) agree with the way Sarno (and you) describe what the mechanism is and what the emotions are doing, at least for my own experience and the couple other people I know personally who had similar RSI experiences.
I think it’s possible to use muscles in a way that’s painful: something like having the muscles be generally tense and their blood supply constricted. I think state-of-mind can cause muscles to operate in this painful blood-supply-constricted mode, and that one such state of mind is the feeling “These muscles here are super-injured and when I use them it’s probably making things worse.”
If you think about it, it’s easy to think of lots of things could predispose people to get stuck in this particular vicious cycle, including personality type, general stress level, factual beliefs about the causes and consequences of chronic pain, and so on. But at the end, I think it’s this pretty specific thing, not a catch-all generic mechanism of subconscious expression or whatever like you seem to be thinking.