I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , X/Twitter , Bluesky , Mastodon , Threads , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
Steven Byrnes
The OP talks about the fact that evolution produced lots of organisms on Earth, of which humans are just one example, and that if we view the set of all life, arguably more of it consists of bacteria or trees than humans. Then this comment thread has been about the question: so what? Why bring that up? Who cares?
Like, here’s where I think we’re at in the discussion:
Nate or Eliezer: “Evolution made humans, and humans don’t care about inclusive genetic fitness.”
tailcalled: “Ah, but did you know that evolution also made bacteria and trees?”
Nate or Eliezer: “…Huh? What does that have to do with anything?”
If you think that the existence on Earth of lots of bacteria and trees is a point that specifically undermines something that Nate or Eliezer said, then can you explain the details?
Here’s a sensible claim:
CLAIM A: “IF there’s a learning algorithm whose reward function is X, THEN the trained models that it creates will not necessarily explicitly desire X.”
This is obviously true, and every animal including humans serves as an example. For most animals, it’s trivially true, because most animals doesn’t even know what inclusive genetic fitness is, so obviously they don’t explicitly desire it.
So here’s a stronger claim:
CLAIM B: “CLAIM A is true even if the trained model is sophisticated enough to fully understand what X is, and to fully understand that it was itself created by this learning algorithm.”
This one is true too, and I think humans are the only example we have. I mean, the claim is really obvious if you know how algorithms work etc., but of course some people question it anyway, so it can be nice to have a concrete illustration.
(More discussion here.)
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
I’ve been on twitter since 2013 and have only ever used the OG timeline (a.k.a. chronological, a.k.a. “following”, a.k.a. every tweet from the people you follow and no others). I think there were periods where the OG timeline was (annoyingly) pretty hard to find, and there were periods where you would be (infuriatingly) auto-switched out of the OG timeline every now and then (weekly-ish?) and had to manually switch back. The OG timeline also has long had occasional advertisements of course. And you might be right that (in some periods) the OG timeline also included occasional other tweets that shouldn’t be in the OG timeline but were thrown in. IIRC, I thought of those as being in the same general category as advertisements, but just kinda advertisements for using more twitter. I think there was a “see less often” option for those, and I always selected that, and I think that helped maintain the relative purity of my OG timeline.
FWIW I don’t think “self-models” in the Intuitive Self-Models sense are related to instrumental power-seeking—see §8.2. For example, I think of my toenail as “part of myself”, but I’m happy to clip it. And I understand that if someone “identifies with the universal consciousness”, their residual urges towards status-seeking, avoiding pain, and so on are about the status and pain of their conventional selves, not the status and pain of the universal consciousness. More examples here and here.
Separately, I’m not sure what if anything the Intuitive Self-Models stuff has to do with LLMs in the first place.
But there’s a deeper problem: the instrumental convergence concern is about agents that have preferences about the state of the world in the distant future, not about agents that have preferences about themselves. (Cf. here.) So for example, if an agent wants there to be lots of paperclips in the future, then that’s the starting point, and everything else can be derived from there.
Q: Does the agent care about protecting “the temporary state of the execution of the model (or models)”?
A: Yes, if and only if protecting that state is likely to ultimately lead to more paperclips.
Q: Does the agent care about protecting “the compute resources (CPU/GPU/RAM) allocated to run the model and its collection of support programs”?
A: Yes, If and only if protecting those resources is likely to ultimately lead to more paperclips.
Etc. See what I mean? That’s instrumental convergence, and self-models have nothing to do with it.
Sorry if I’m misunderstanding.
Thanks for the comment!
people report advanced meditative states that lose many of the common properties of consciousness, including Free Will, the feeling of having a self (I’ve experienced that one!) and even the presence of any information content whatsoever, and afaik they tend to be more “impressed”, roughly speaking, with consciousness as a result of those experiences, not less.
I think that’s compatible with my models, because those meditators still have a cortex, in which patterns of neurons can be firing or not firing at any particular time. And that’s the core aspect of the “territory” which corresponds to “conscious awareness” in the “map”. No amount of meditation, drugs, etc., can change that.
Attempt to rephrase: the brain has several different intuitive models in different places. These models have different causal profiles, which explains how they can correspond to different introspective reports.
Hmm, I think that’s not really what I would say. I would say that that there’s a concept “conscious awareness” (in the map) that corresponds to the fact (in the territory) that different patterns of neurons can be active or inactive in the cortex at different times. And then there are more specific aspects of “conscious awareness”, like “visual awareness”, which corresponds to the fact that the cortex has different parts (motor cortex etc.), and different patterns of neurons can be active or inactive in any given part of the cortex at different times.
…Maybe this next part will help ↓
the distinction between visually vivid experience and vague intuitions isn’t just that we happen to call them by different labels … Claiming to see a visual image is different from claiming to have a vague intuition in all the ways that it’s different
The contents of IT are really truly different from the contents of LIP [I didn’t check where the visual information gets to the cortex in blindsight, I’m just guessing LIP for concreteness]. Querying IT is a different operation than querying LIP. IT holds different types of information than LIP does, and does different things with that information, including leading to different visceral reactions, motivations, semantic knowledge, etc., all of which correspond to neuroscientific differences in how IT versus LIP is wired up.
All these differences between IT vs LIP are in the territory, not the map. So I definitely agree that “the distinction [between seeing and vague-sense-of-presence] isn’t just that we happen to call them by different labels”. They’re different like how the concept “hand” is different from the concept “foot”—a distinction on the map downstream of a distinction in the territory.
Is awareness really a serial processor in any meaningful way if it can contain as much information at once as a visual image seems to contain?
I’m sure you’re aware that people feel like they have a broader continuous awareness of their visual field than they actully do. There are lots of demonstrations of this—e.g. change blindness, selective attention test, the fact that peripheral vision has terrible resolution and terrible color perception and makes faces look creepy. There’s a refrigerator light illusion thing—if X is in my peripheral vision, then maybe it’s currently active as just a little pointer in a tiny sub-area of my cortex, but as soon as I turn my attention to X it immediately unfolds in full detail across the global workspace.
The cortex has 10 billion neurons which is more than enough to do some things in parallel—e.g. I can have a song stuck in my head in auditory cortex, while tapping my foot with motor cortex, while doing math homework with other parts of the cortex. But there’s also a serial aspect to it—you can’t parse a legal document and try to remember your friend’s name at the exact same moment.
Does that help? Sorry if I’m not responding to what you see as most important, happy to keep going. :)
Thanks for the detailed comment!
Well, post #2 is about conscious awareness so it gets the closest, but you only really talk about how there is a serial processing stream in the brain whose contents roughly correspond to what we claim is in awareness—which I’d argue is just the coarse functional behavior, i.e., the macro problem. This doesn’t seem very related to the hard meta problem because I can imagine either one of the problems not existing without the other. I.e., I can imagine that (a) people do claim to be conscious but in a very different way, and (b) people don’t claim to be conscious, but their high-level functional recollection does match the model you describe in the post. And if that’s the case, then by definition they’re independent. … if you actually ask camp #2 people, I think they’ll tell you that the problem isn’t really about the macro functional behavior of awareness
The way intuitive models work (I claim) is that there are concepts, and associations / implications / connotations of those concepts. There’s a core intuitive concept “carrot”, and it has implications about shape, color, taste, botanical origin, etc. And if you specify the shape, color, etc. of a thing, and they’re somewhat different from most normal carrots, then people will feel like there’s a question “but now is it really a carrot?” that goes beyond the complete list of its actual properties. But there isn’t, really. Once you list all the properties, there’s no additional unanswered question. It just feels like there is. This is an aspect of how intuitive models work, but it doesn’t veridically correspond to anything of substance.
The old Yudkowsky post “How An Algorithm Feels From Inside” is a great discussion of this point.
So anyway, if “consciousness” has connotations / implications A,B,C,D,E, etc. (it’s “subjective”, it goes away under general anesthesia, it’s connected to memory, etc.), then people will feel like there’s an additional question “but is it really consciousness”, that still needs to be answered, above and beyond the specific properties A,B,C,D,E.
And likewise, if you ask a person “Can you imagine something that lacks A,B,C,D,E, but still constitutes ‘consciousness’”, then they may well say “yeah I can imagine that”. But we shouldn’t take that report to be particularly meaningful.
(…See also Frankish’s “Quining Diet Qualia” (2012).)
Copying the above terminology, we could phrase the hard problem of seeing as explaining why people see images, and the hard meta problem of seeing as explaining why people claim to see images.
As in Post 2, there’s an intuitive concept that I’m calling “conscious awareness” that captures the fact that the cortex has different generative models active at different times. Different parts of the cortex wind up building different kinds of models—S1 builds generative models of somatosensory data, M1 builds generative models of motor programs, and so on. But here I want to talk about the areas in the overlap between the “ventral visual stream” and the “global workspace”, which is mainly in and around the inferior temporal gyrus, “IT”.
When we’re paying attention to what we’re looking at, IT would have some generative model active that optimally balances between (1) priors about the visual world, and (2) the visual input right now. Alternatively, if we’re zoning out from what we’re looking at, and instead using visual imagination or visual memory, then (2) is off (i.e., the active IT model can be wildly incompatible with immediate visual input), but (1) is still relevant, and instead there needs to be consistency between IT and episodic memory areas, or various other possibilities.
So anyway,
In the territory, “Model A is currently active in IT” is a very different situation from “Model B is currently active in the superior temporal gyrus” or whatever.
Correspondingly, in the map, we wind up with the intuition that “X is in awareness as a vision” is very different from “Y is in awareness as a sound”, and both are very different from “Z is in awareness as a plan”, etc.
You brought up blindsight. That would be where the model “X is in awareness as a vision” seems wrong. That model would entail a specific set of predictions about the state of IT, and it turns out that those predictions are false. However, some other part of awareness is still getting visual information via some other pathway. (Visual information gets into various parts of the cortex via more than one pathway.) So the blindsight patient might describe their experience as “I don’t see anything, but for some reason I feel like there’s motion on the left side”, or whatever. And we can map that utterance into a correct description of what was happening in their cortex.
Separately, as for the hard problem of consciousness, you might be surprised to learn that I actually haven’t thought about it much and still find it kinda confusing. I had written something into an early draft of post 1 but wound up deleting it before publication. Here’s what it said:
Start with an analogy to physics. There’s a Stephen Hawking quote I like:
> “Even if there is only one possible unified theory, it is just a set of rules and equations. What is it that breathes fire into the equations and makes a universe for them to describe? The usual approach of science of constructing a mathematical model cannot answer the questions of why there should be a universe for the model to describe. Why does the universe go to all the bother of existing?”
I could be wrong, but Hawking’s question seems to be pointing at a real mystery. But as Hawking says, there seems to be no possible observation or scientific experiment that would shed light on that mystery. Whatever the true laws of physics are in our universe, every possible experiment would just confirm, yup, those are the true laws of physics. It wouldn’t help us figure out what if anything “breathes fire” into those laws. What would progress on the “breathes fire” question even look like?? (See Tegmark’s Mathematical Universe book for the only serious attempt I know of, which I still find unsatisfying. He basically says that all possible laws of the universe have fire breathed into them. But even if that’s true, I still want to ask … why?)
By analogy, I’m tempted to say that an illusionist account can explain every possible experiment about consciousness, including our belief that consciousness exists at all, and all its properties, and all the philosophy books on it, and so on … but yet I’m tempted to still say that there’s some “breathes fire” / “why is there something rather than nothing” type question left unanswered by the illusionist account. This unanswered question should not be called “the hard problem”, but rather “the impossible problem”, in the sense that, just like Hawking’s question above, there seems to be no possible scientific measurement or introspective experiment and that could shed light on it—all possible such data, including the very fact that I’m writing this paragraph, are already screened off by the illusionist framework.
Well, hmm, maybe that’s stupid. I dunno.
Thanks!
Do you have any thoughts on why then does psychosis typically suddenly ‘kick in’ in late adolescence / early adulthood?
Yeah as I discussed in Schizophrenia as a deficiency in long-range cortex-to-cortex communication Section 4.1, I blame synaptic pruning, which continues into your 20s.
and why trauma correlates with it and tends to act as that ‘kickstarter’?
No idea. As for “kickstarter”, my first question is: is that actually true? It might be correlation not causation. It’s hard to figure that out experimentally. That said, I have some discussion of how strong emotions in general, and trauma in particular, can lead to hallucinations (e.g. hearing voices) and delusions via a quite different mechanism in [Intuitive self-models] 7. Hearing Voices, and Other Hallucinations. I’ve been thinking of “psychosis via disjointed cognition” (schizophrenia & mania per this post) and “psychosis via strong emotions” (e.g. trauma, see that other post) as pretty different and unrelated, but I guess it’s maybe possible that there’s some synergy where their effects add up such that someone who is just under the threshold for schizophrenic delusions can get put over the top by strong emotions like trauma.
Also any thoughts about delusions? Like how come schizophrenic people will occasionally not just believe in impossible things but very occasionally even random things like ‘I am Jesus Christ’ or ‘I am Napoleon’?
I talk about that a bit better in the other post:
In the diagram above, I used “command to move my arm” as an example. By default, when my brainstem notices my arm moving unexpectedly, it fires an orienting / startle reflex—imagine having your arm resting on an armrest, and the armrest suddenly starts moving. Now, when it’s my own motor cortex initiating the arm movement, then that shouldn’t be “unexpected”, and hence shouldn’t lead to a startle. However, if different parts of the cortex are sending output signals independently, each oblivious to what the other parts are doing, then a key prediction signal won’t get sent down into the brainstem, and thus the motion will in fact be “unexpected” from the brainstem’s perspective. The resulting suite of sensations, including the startle, will be pretty different from how self-generated motor actions feel, and so it will be conceptualized differently, perhaps as a “delusion of control”.
That’s just one example. The same idea works equally well if I replace “command to move my arm” with “command to do a certain inner speech act”, in which case the result is an auditory hallucination. Or it could be a “command to visually imagine something”, in which case the result is a visual hallucination. Or it could be some visceromotor signal that causes physiological arousal, perhaps leading to a delusion of reference, and so on.
So, I dunno, imagine that cortex area 1 is a visceromotor area saying “something profoundly important is happening right now!” for some random reason, and independently, cortex area 2 is saying “who am I?”, and independently, cortex area 3 is saying “Napoleon”. All three of these things are happening independently and unrelatedly. But because of cortex area 1, there’s strong physiological arousal that sweeps through the brain and locks in this configuration within the hippocampus as a strong memory that “feels true” going forward.
That’s probably not correct in full detail, but my guess is that it’s something kinda like that.
I’d bet that Noam Brown’s TED AI talk has a lot of overlap with this one that he gave in May. So you don’t have to talk about it second-hand, you can hear it straight from the source. :) In particular, the “100,000×” poker scale-up claim is right near the beginning, around 6 minutes in.
The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters.
Yup, this is what we’re used to today:
there’s an information repository,
there’s a learning algorithm that updates the information repository,
there’s an inference algorithm that queries the information repository,
both the learning algorithm and the inference algorithm consist of legible code written by humans, with no inscrutable unlabeled parameters,
the high-dimensional space [or astronomically-large set, if it’s discrete] of all possible configurations of the information repository is likewise defined by legible code written by humans, with no inscrutable unlabeled parameters,
the only inscrutable unlabeled parameters are in the content of the information repository, after the learning algorithm has been running for a while.
So for example, in LLM pretraining, the learning algorithm is backprop, the inference algorithm is a forward pass, and the information repository is the weights of a transformer-architecture neural net. There’s nothing inscrutable about backprop, nor about a forward pass. We fully understand what those are doing and how. Backprop calculates the gradient, etc.
That’s just one example. There are many other options! The learning algorithm could involve TD learning. The inference algorithm could involve tree search, or MCMC, or whatever. The information repository could involve a learned value function and/or a learned policy and/or a learned Bayes net and/or a learned OpenCog AtomSpace or whatever. But in all cases, those six bullets above are valid.
So anyway, this is already how ML works, and I’m very confident that it will remain true until TAI, for reasons here. And this is a widespread consensus.
By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don’t know) of how to make the world model not contain dangerous minds.
There’s a very obvious failure mode in which: the world-model models the world, and the planner plans, and the value function calculates values, etc. … and at the end of all that, the AI system as a whole hatches and executes a plan to wipe out humanity. The major unsolved problem is: how do we confidently avoid that?
Then separately, there’s a different, weird, exotic type of failure mode, where, for example, there’s a full-fledged AGI agent, one that can do out-of-the-box foresighted planning etc., but this agent is not working within the designed AGI architecture (where the planner plans etc. as above), but rather the whole agent is hiding entirely within the world-model. I think that, in this kind of system, the risk of this exotic failure mode is very low, and can be straightforwardly mitigated to become even lower still. I wrote about it a long time ago at Thoughts on safety in predictive learning.
I really think we should focus first and foremost on the very obvious failure mode, which again is an unsolved problem that is very likely to manifest, and we should put aside the weird exotic failure mode at least until we’ve solved the big obvious one.
When we put aside the exotic failure mode and focus on the main one, then we’re no longer worried about “the world model contains dangerous minds”, but rather we’re worried about “something(s) in the world model has been flagged as desirable, that shouldn’t have been flagged as desirable”. This is a hard problem not only because of the interpretability issue (I think we agree that the contents of the world-model are inscrutable, and I hope we agree that those inscrutable contents will include both good things and bad things), but also because of concept extrapolation / goal misgeneralization (i.e., the AGI needs to have opinions about plans that bring it somewhere out of distribution). It’s great if you want to think about that problem, but you don’t need to “understand intelligence” for that, you can just assume that the world-model is a Bayes net or whatever, and jump right in! (Maybe start here!)
To me it just seems that limiting the depth of a tree search is better that limiting the compute of a black box neural network. It seems like you can get a much better grip on what it means to limit the depth, and what this implies about the system behavior, when you actually understand how tree search works. Of cause tree search here is only an example.
Right, but the ability to limit the depth of a tree search is basically useless for getting you to safe and beneficial AGI, because you don’t know the depth that allows dangerous plans, nor do you know that dangerous plans won’t actually be simpler (less depth) than intended plans. This is a very general problem. This problem applies equally well to limiting the compute of a black box, limiting the number of steps of MCMC, limiting the amount of (whatever OpenCog AtomSpace does), etc.
[You can also potentially use tree search depth to try to enforce guarantees about myopia, but that doesn’t really work for other reasons.]
Python code is a discrete structure. You can do proofs on more easily than for a NN. You could try to apply program transformations on it that preserve functional equality, trying to optimize for some measure of “human understandable structure”. There are image classification alogrithms iirc that are worse than NN but much more interpretable, and these algorithms would at most be hundets of lines of code I guess (haven’t really looked a lot at them).
“Hundreds of lines” is certainly wrong because you can recognize easily tens of thousands of distinct categories of visual objects. Probably hundreds of thousands.
Proofs sound nice, but what do you think you can realistically prove that will help with Safe and Beneficial AGI? You can’t prove things about what AGI will do in the real world, because the real world will not be encoded in your formal proof system. (pace davidad).
“Applying program transformations that optimize for human understandable structure” sounds nice, but only gets you to “inscrutable” from “even more inscrutable”. The visual world is complex. The algorithm can’t be arbitrarily simple, while still capturing that complexity. Cf. “computational irreducibility”.
I’m not brainstorming on “how could this system fail”. Instead I understand something, and then I just notice without really trying, that now I can do a thing that seems very useful, like making the system not think about human psycology given certain constraints.
What I’m trying to do in this whole comment is point you towards various “no-go theorems” that Eliezer probably figured out in 2006 and put onto Arbital somewhere.
Here’s an analogy. It’s appealing to say: “I don’t understand string theory, but if I did, then I would notice some new obvious way to build a perpetual motion machine.”. But no, you won’t. We can rule out perpetual motion machines from very general principles that don’t rely on how string theory works.
By the same token, it’s appealing to say: “I don’t understand intelligence, but if I did, then I would notice some new obvious way to guarantee that an AGI won’t try to manipulate humans.”. But no, you won’t. There are deep difficulties that we know you’re going to run into, based on very general principles that don’t rely on the data format for the world-model etc.
I suggest to think harder about the shape of the solution—getting all the way to Safe & Beneficial AGI. I think you’ll come to realize that figuring out the data format for the world-model etc. is not only dangerous (because it’s AGI capabilities research) but doesn’t even help appreciably with safety anyway.
Huh, funny you think that. From my perspective, “modeling how other people model me” is not relevant to this post. I don’t see anywhere that I even mentioned it. It hardly comes up anywhere else in the series either.
John’s post is quite wierd, because it only says true things, and implicitly implies a conclusion, namely that NNs are not less interpretable than some other thing, which is totally wrong.
Example: A neural network implements modular arithmetic with furier transforms. If you implement that furier algorithm in python, it’s harder to understand for a human than the obvious modular arithmetic implementation in python.
Again see my comment. If an LLM does Task X with a trillion unlabeled parameters and (some other thing) does the same Task X with “only” a billion unlabeled parameters, then both are inscrutable.
Your example of modular arithmetic is not a central example of what we should expect to happen, because “modular arithmetic in python” has zero unlabeled parameters. Realistically, an AGI won’t be able to accomplish any real-world task at all with zero unlabeled parameters.
I propose that a more realistic example would be “classifying images via a ConvNet with 100,000,000 weights” versus “classifying images via 5,000,000 lines of Python code involving 1,000,000 nonsense variable names”. The latter is obviously less inscrutable on the margin but it’s not a huge difference.
The goal is to understand how intelligence works. Clearly that would be very useful for alignment?
If “very useful for alignment” means “very useful for doing technical alignment research”, then yes, clearly.
If “very useful for alignment” means “increases our odds of winding up with aligned AGI”, then no, I don’t think it’s true, let alone “clearly” true.
If you don’t understand how something can simultaneously both be very useful for doing technical alignment research and decrease our odds of winding up with aligned AGI, here’s a very simple example. Suppose I posted the source code for misaligned ASI on github tomorrow. “Clearly that would be very useful” for doing technical alignment research, right? Who could disagree with that? It would open up all sorts of research avenues. But also, it would also obviously doom us all.
For more on this topic, see my post “Endgame safety” for AGI.
E.g. I could define theoretically a general algoritm that identifies the minimum concrepts neccesary, if I know enough about the structure of the system, specifically how concepts are stored, for solving a task. That’s of cause not perfect, but it would seem that for very many problems it would make the AI unable to think about things like human manipulation, or that it is a constrained AI, even if that knowledge was somewhere in a learned black box world model.
There’s a very basic problem that instrumental convergence is convergent because it’s actually useful. If you look at the world and try to figure out the best way to design a better solar cell, that best way involves manipulating humans (to get more resources to run more experiments etc.).
Humans are part of the environment. If an algorithm can look at a street and learn that there’s such a thing as cars, the very same algorithm will learn that there’s such a thing as humans. And if an algorithm can autonomously figure out how an engine works, the very same algorithm can autonomously figure out human psychology.
You could remove humans from the training data, but that leads to its own problems, and anyway, you don’t need to “understand intelligence” to recognize that as a possibility (e.g. here’s a link to some prior discussion of that).
Or you could try to “find” humans and human manipulation in the world-model, but then we have interpretability challenges.
Or you could assume that “humans” were manually put into the world-model as a separate module, but then we have the problem that world-models need to be learned from unlabeled data for practical reasons, and humans could also show up in the other modules.
Anyway, it’s fine to brainstorm on things like this, but I claim that you can do that brainstorming perfectly well by assuming that the world model is a Bayes net (or use OpenCog AtomSpace, or Soar, or whatever), or even just talk about it generically.
If your system is some plain code with for loops, just reduce the number the for loops of seach processes do. Now decreasing/incleasing the iterations somewhat will correspond to making the system dumber/smarter. Again obviously not solving the problem completely, but clearly a powerful thing to be able to do.
I’m 100% confident that, whatever AGI winds up looking like, “we could just make it dumber” will be on the table as an option. We can give it less time to find a solution to a problem, and then the solution it finds (if any) will be worse. We can give it less information to go on. Etc.
You don’t have to “understand intelligence” to recognize that we’ll have options like that. It’s obvious. That fact doesn’t come up very often in conversation because it’s not all that useful for getting to Safe and Beneficial AGI.
Again, if you assume the world model is a Bayes net (or use OpenCog AtomSpace, or Soar), I think you can do all the alignment thinking and brainstorming that you want to do, without doing new capabilities research. And I think you’d be more likely (well, less unlikely) to succeed anyway.
This post is about science. How can we think about psychology and neuroscience in a clear and correct way? “What’s really going on” in the brain and mind?
By contrast, nothing in this post (or the rest of this series), is practical advice about how to be mentally healthy, or how to carry on a conversation, etc. (Related: §1.3.3.)
Does that help? Sorry if that was unclear.
See Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc, including my comment on it. If your approach would lead to a world-model that is an uninterpretable inscrutable mess, and LLM research would lead to a world-model that is an even more uninterpretable, even more inscrutable mess, then I don’t think this is a reason to push forward on your approach, without a good alignment plan.
Yes, it’s a pro tanto reason to prefer your approach, other things equal. But it’s a very minor reason. And other things are not equal. On the contrary, there are a bunch of important considerations plausibly pushing in the opposite direction:
Maybe LLMs will plateau anyway, so the comparison between inscrutable versus even-more-inscrutable is a moot point. And then you’re just doing AGI capabilities research for no safety benefit at all. (See “Endgame safety” for AGI.)
LLMs at least arguably have some safety benefits related to reliance on human knowledge, human concepts, and chains-of-thought, whereas the kind of AGI you’re trying to invent might not have those.
Your approach would (if “successful”) be much, much more compute-efficient—probably by orders of magnitude—see Section 3 here for a detailed explanation of why. This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder. (Related: I for one think AGI is possible on a single consumer GPU, see here.)
Likewise, your approach would (if “successful”) have a “better” inductive bias, “better” sample efficiency, etc., because you’re constraining the search space. That suggests fast takeoff and less likelihood of a long duration of janky mediocre-human-level AGIs. I think most people would see that as net bad for safety.
In any case, it seems that this is a problem that any possible way to build an intelligence runs into? So I don’t think it is a case against the project.
If it’s a problem for any possible approach to building AGI, then it’s an argument against pursuing any kind of AGI capabilities research! Yes! It means we should focus first on solving that problem, and only do AGI capabilities research when and if we succeed. And that’s what I believe. Right?
It seems plausible that one could simply by understanding the system very well, make it such that the learned datastrucutres need to take particular shapes, such that these shapes correspond some relevant alignment properties.
I don’t think this is plausible. I think alignment properties are pretty unrelated to the low-level structure out of which a world-model is built. For example, the difference between “advising a human” versus “manipulating a human”, and the difference between “finding a great out-of-the-box solution” versus “reward hacking”, are both extremely important for alignment. But you won’t get insight into those distinctions, or how to ensure them in an AGI, by thinking about whether world-model stuff is stored as connections on graphs versus induction heads or whatever.
Anyway, if your suggestion is true, I claim you can (and should) figure that out without doing AGI capabilities research. Here’s an example. Assume that the the learned data structure is a Bayes net, or some generalization of a Bayes net, or the OpenCog “AtomSpace”, or whatever. OK, now spend as long as you like thinking about what if anything that has to do with “alignment properties”. My guess is “very little”. Or if you come up with anything, you can share it. That’s not advancing capabilities, because people already know that there is such a thing as Bayes nets / OpenCog / whatever.
Alternatively, another concrete thing that you can chew on is: brain-like AGI. :) We already know a lot about how it works without needing to do any new capabilities research. For example, you might start with Plan for mediocre alignment of brain-like [model-based RL] AGI and think about how to make that approach better / less bad.
I think Seth is distinguishing “aligning LLM agents” from “aligning LLMs”, and complaining that there’s insufficient work on the former, compared to the latter? I could be wrong.
I don’t actually know what it means to work on LLM alignment over aligning other systems
Ooh, I can speak to this. I’m mostly focused on technical alignment for actor-critic model-based RL systems (a big category including MuZero and [I argue] human brains). And FWIW my experience is: there are tons of papers & posts on alignment that assume LLMs, and with rare exceptions I find them useless for the non-LLM algorithms that I’m thinking about.
As a typical example, I didn’t get anything useful out of Alignment Implications of LLM Successes: a Debate in One Act—it’s addressing a debate that I see as inapplicable to the types of AI algorithms that I’m thinking about. Ditto for the debate on chain-of-thought accuracy vs steganography and a zillion other things.
When we get outside technical alignment to things like “AI control”, governance, takeoff speed, timelines, etc., I find that the assumption of LLMs is likewise pervasive, load-bearing, and often unnoticed.
I complain about this from time to time, for example Section 4.2 here, and also briefly here (the bullets near the bottom after “Yeah some examples would be:”).
I didn’t read it very carefully but how would you respond to the dilemma:
If the programmer has to write things like “tires are black” into the source code, then it’s totally impractical. (…pace davidad & Doug Lenat.)
If the programmer doesn’t have to write things like “tires are black” into the source code, then presumably a learning algorithm is figuring out things like “tires are black” from unlabeled data. And then you’re going to wind up with some giant data structure full of things like “ENTITY 92852384 implies ENTITY 8593483 with probability 0.36”. And then we have an alignment problem because the AI’s goals will be defined in terms of these unlabeled entities which are hard to interpret, and where it’s hard to guess how they’ll generalize after reflection, distributional shifts, etc.
I’m guessing you’re in the second bullet but I’m not sure how you’re thinking about this alignment concern.
Yeah, I think the §3.3.1 pattern (intrinsic surprisingness) is narrower than the §3.3.4 pattern (intrinsic surprisingness but with an ability to make medium-term predictions).
But they tend to go together so much in practice (life experience) that when we see the former we generally kinda assume the latter. An exception might be, umm, a person spasming, or having a seizure? Or a drunkard wandering about randomly? Hmm, maybe those don’t count because there are still some desires, e.g. the drunkard wants to remain standing.
I agree that agency / life-force has a strong connotation of the §3.3.4 thing, not just the §3.3.1 thing. Or at least, it seems to have that connotation in my own intuitions. ¯\_(ツ)_/¯
Hmm, I still might not be following, but I’ll write something anyway. :)
Take some “concept” in your world-model, operationalized as a particular cluster C of neurons in some part of your cortex that tend to activate together.
How might we figure out what what C “means”?
One part of the answer is entirely within the cortex world-model: C has particular relationships to other things in the cortex world-model, which in term have relationships to still other things etc. Clusters of neurons related to “bird” have some connection to clusters of neurons related to “flying”. That by itself might already be enough to pin down the “meanings” of different things, just because there’s so much structure there, and we can try to match it up with structures in the world, by analogy with unsupervised machine translation. But if not…
The other part of the answer is about how the cortex world-model relates to the real world. Maybe C directly predicts some particular pattern in low-level sensory inputs. Maybe C directly activates some particular pattern in motor output. Or maybe the connection is less direct—a certain abstract pattern in the space of abstract patterns in the space of abstract patterns in the space of low-level sensory inputs, or whatever. If we look at naturalistic visual inputs that directly or indirectly trigger C, and they’re disproportionately pictures of clocks, then that’s some evidence that C “means” clock.
So, how about “cold”? Our body has a couple relevant sensors: peripheral nerves that express TRPM8 (“cold and menthol receptor 1”), hypothalamus neurons that detect blood temperature via TRPV1, etc. (I’m not an expert on the details.) As usual, these sensory signals are processed in two areas in parallel. In the hypothalamus & brainstem (“Steering Subsystem”), they trigger innate reactions like shivering, unpleasant feelings / desire to warm up, and so on. And in the cortex, they’re treated as just so many more channels of unlabeled input data that the world-model needs to predict.
In the course of predicting them well, the world-model invents some slightly-higher-level concept (or family of closely-interlinked concepts) that we call “cold”. And it notices and memorizes predictively-useful relationships between this new “cold” concept and other things in the world-model, e.g. shivering and ice.
I don’t think there’s more to the concept “cold” than the sum total of its associations with every other concept, with sensory input, and with motor output. And we can explain those latter associations via the structure of the world and body in conjunction with a learning algorithm running throughout your life experience.
You can sorta write code for a relevant part of what’s happening in the mind when e.g. the freezing emotion/sensation is triggered.
I like to draw the distinction between understanding learning algorithms and understanding trained models. The former is kinda like what you learn in an ML course (gradient descent, training data, etc.) , the latter is kinda like what you learn in a mechanistic interpretability paper. I don’t think it’s realistic to “write code” for the “cold” concept, because I think it (like all concepts) emerges at the trained model level. It emerges from a learning algorithm, training environment, loss function, etc.
Of course, we can chat about the trained model level to some extent. Why is “cold” associated with shivering? Because in the training environment of life experience, those two things have tended to go together, such that each provides nonzero Bayesian evidence that the other should be active, or will be soon. Ditto with the connection between cold and ice cream, and everything else. So we can chat about it, but it would take forever to directly write code for all those things. Hence the learning algorithm. Does that help?
I disagree with “He seems to have no inside information.” He presented himself as having no inside information, but that’s presumably how he would have presented himself regardless of whether he had inside information or not. It’s not like he needed to convince others that he knows what he’s doing, like how in the stock market you want to buy then pump then sell. This is different—it’s a market that’s about to resolve. The smart play from his perspective would be to aggressively trash-talk his own competence, to lower the price in case he wants to buy more.
Possibly related: Could we use current AI methods to understand dolphins? + comments
Hmm, I think the point I’m trying to make is: it’s dicey to have a system S that’s being continually modified to systematically reduce some loss L, but then we intervene to edit S in a way that increases L. We’re kinda fighting against the loss-reducing mechanism (be it gradient descent or bankroll-changes or whatever), hoping that the loss-reducing mechanism won’t find a “repair” that works around our interventions.
In that context, my presumption is that an AI will have some epistemic part S that’s continually modified to produce correct objective understanding of the world, including correct anticipation of the likely consequences of actions. The loss L for that part would probably be self-supervised learning, but could also include self-consistency or whatever.
And then I’m interpreting you (maybe not correctly?) as proposing that we should consider things like making the AI have objectively incorrect beliefs about (say) bioweapons, and I feel like that’s fighting against this L in that dicey way.
Whereas your Q-learning example doesn’t have any problem with fighting against a loss function, because Q(S,A) is being consistently and only updated by the reward.
The above is inapplicable to LLMs, I think. (And this seems tied IMO to the fact that LLMs can’t do great novel science yet etc.) But it does apply to FixDT.
Specifically, for things like FixDT, if there are multiple fixed points (e.g. I expect to stand up, and then I stand up, and thus the prediction was correct), then whatever process you use to privilege one fixed point over another, you’re not fighting against the above L (i.e., the “epistemic” loss L based on self-supervised learning and/or self-consistency or whatever). L is applying no force either way. It’s a wide-open degree of freedom.
(If your response is “L incentivizes fixed-points that make the world easier to predict”, then I don’t think that’s a correct description of what such a learning algorithm would do.)
So if your feedback proposal exclusively involves a mechanism that privileging one fixed point over another, then I have no complaints, and would describe it as choosing a utility function (preferences not beliefs) within the FixDT framework.
Btw I think we’re in agreement that there should be some mechanism privileging one fixed point over another, instead of ignoring it and just letting the underdetermined system do whatever it does.
Oh, I want to set that problem aside because I don’t think you need an arbitrarily rich hypothesis space to get ASI. The agency comes from the whole AI system, not just the “epistemic” part, so the “epistemic” part can be selected from a limited model class, as opposed to running arbitrary computations etc. For example, the world model can be “just” a Bayes net, or whatever. We’ve talked about this before.
I also learned the term observation-utility agents from you :) You don’t think that can solve those problems (in principle)?
I’m probably misunderstanding you here and elsewhere, but enjoying the chat, thanks :)