I didn’t read it very carefully but how would you respond to the dilemma:
If the programmer has to write things like “tires are black” into the source code, then it’s totally impractical. (…pace davidad & Doug Lenat.)
If the programmer doesn’t have to write things like “tires are black” into the source code, then presumably a learning algorithm is figuring out things like “tires are black” from unlabeled data. And then you’re going to wind up with some giant data structure full of things like “ENTITY 92852384 implies ENTITY 8593483 with probability 0.36”. And then we have an alignment problem because the AI’s goals will be defined in terms of these unlabeled entities which are hard to interpret, and where it’s hard to guess how they’ll generalize after reflection, distributional shifts, etc.
I’m guessing you’re in the second bullet but I’m not sure how you’re thinking about this alignment concern.
If you had a system with “ENTITY 92852384 implies ENTITY 8593483” it would be a lot of progress, as currently in neural networks we don’t even understand the interal structures.
I want to have an algorithm that creates a world model. The world is large. A world model is uninterpretable by default through it’s sheer size, even if you had interpretable but low level abels. By default we don’t get any interpretable labels. I think there are ways to have generic dataprocessing procedures that don’t talk about the human mind at all, that would yield more interpretable world model. Similar to how you could probably specify some very general property about python programs, such that that program becomes easier to understand by humans. E.g. a formalism of what it means that the control flow is straightforward: Don’t use goto in C.
But even if you wouldn’t have this, understanding the system still allows you to understand what the structure of the knowledge would be. It seems plausible that one could simply by understanding the system very well, make it such that the learned datastrucutres need to take particular shapes, such that these shapes correspond some relevant alignment properties.
In any case, it seems that this is a problem that any possible way to build an intelligence runs into? So I don’t think it is a case against the project. When building an AI with NN you might not even think about that the interal representations might be wierd and alien (even for an LLM trained on human text)[1], but the same problem persists.
See Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc, including my comment on it. If your approach would lead to a world-model that is an uninterpretable inscrutable mess, and LLM research would lead to a world-model that is an even more uninterpretable, even more inscrutable mess, then I don’t think this is a reason to push forward on your approach, without a good alignment plan.
Yes, it’s a pro tanto reason to prefer your approach, other things equal. But it’s a very minor reason. And other things are not equal. On the contrary, there are a bunch of important considerations plausibly pushing in the opposite direction:
Maybe LLMs will plateau anyway, so the comparison between inscrutable versus even-more-inscrutable is a moot point. And then you’re just doing AGI capabilities research for no safety benefit at all. (See “Endgame safety” for AGI.)
LLMs at least arguably have some safety benefits related to reliance on human knowledge, human concepts, and chains-of-thought, whereas the kind of AGI you’re trying to invent might not have those.
Your approach would (if “successful”) be much, much more compute-efficient—probably by orders of magnitude—see Section 3 here for a detailed explanation of why. This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder. (Related: I for one think AGI is possible on a single consumer GPU, see here.)
Likewise, your approach would (if “successful”) have a “better” inductive bias, “better” sample efficiency, etc., because you’re constraining the search space. That suggests fast takeoff and less likelihood of a long duration of janky mediocre-human-level AGIs. I think most people would see that as net bad for safety.
In any case, it seems that this is a problem that any possible way to build an intelligence runs into? So I don’t think it is a case against the project.
If it’s a problem for any possible approach to building AGI, then it’s an argument against pursuing any kind of AGI capabilities research! Yes! It means we should focus first on solving that problem, and only do AGI capabilities research when and if we succeed. And that’s what I believe. Right?
It seems plausible that one could simply by understanding the system very well, make it such that the learned datastrucutres need to take particular shapes, such that these shapes correspond some relevant alignment properties.
I don’t think this is plausible. I think alignment properties are pretty unrelated to the low-level structure out of which a world-model is built. For example, the difference between “advising a human” versus “manipulating a human”, and the difference between “finding a great out-of-the-box solution” versus “reward hacking”, are both extremely important for alignment. But you won’t get insight into those distinctions, or how to ensure them in an AGI, by thinking about whether world-model stuff is stored as connections on graphs versus induction heads or whatever.
Anyway, if your suggestion is true, I claim you can (and should) figure that out without doing AGI capabilities research. Here’s an example. Assume that the the learned data structure is a Bayes net, or some generalization of a Bayes net, or the OpenCog “AtomSpace”, or whatever. OK, now spend as long as you like thinking about what if anything that has to do with “alignment properties”. My guess is “very little”. Or if you come up with anything, you can share it. That’s not advancing capabilities, because people already know that there is such a thing as Bayes nets / OpenCog / whatever.
Alternatively, another concrete thing that you can chew on is: brain-like AGI. :) We already know a lot about how it works without needing to do any new capabilities research. For example, you might start with Plan for mediocre alignment of brain-like [model-based RL] AGI and think about how to make that approach better / less bad.
John’s post is quite wierd, because it only says true things, and implicitly implies a conclusion, namely that NNs are not less interpretable than some other thing, which is totally wrong.
Example: A neural network implements modular arithmetic with furier transforms. If you implement that furier algorithm in python, it’s harder to understand for a human than the obvious modular arithmetic implementation in python.
It doesn’t matter if the world model is inscruitable when looking directly at it, if you can change the generating code such that certain properties must hold. Figuring out what these properties is not directly solved by understading intelligence of cause.
This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder.
This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder.
Totally agree, so obviously try super hard to not leak the working AGI code if you had it.
But you won’t get insight into those distinctions, or how to ensure them in an AGI, by thinking about whether world-model stuff is stored as connections on graphs versus induction heads or whatever.
No you can. E.g. I could define theoretically a general algoritm that identifies the minimum concrepts neccesary, if I know enough about the structure of the system, specifically how concepts are stored, for solving a task. That’s of cause not perfect, but it would seem that for very many problems it would make the AI unable to think about things like human manipulation, or that it is a constrained AI, even if that knowledge was somewhere in a learned black box world model. This is just an example of something you can do by knowing the structure of a system.
If your system is some plain code with for loops, just reduce the number the for loops of seach processes do. Now decreasing/incleasing the iterations somewhat will correspond to making the system dumber/smarter. Again obviously not solving the problem completely, but clearly a powerful thing to be able to do.
Of cause many low level details do not matter. Often you’d only care that something is a sequence, or a set. I am talking about a higher level program structure.
It feels like you are somewhat missing the point. The goal is to understand how intelligence works. Clearly that would be very useful for alignment? Even if you would get a blackbox world model. But of cause it would also enable you to think about how to make such a world model more interpretable. I think that is possible, it’s just not what I am focusing on now.
John’s post is quite wierd, because it only says true things, and implicitly implies a conclusion, namely that NNs are not less interpretable than some other thing, which is totally wrong.
Example: A neural network implements modular arithmetic with furier transforms. If you implement that furier algorithm in python, it’s harder to understand for a human than the obvious modular arithmetic implementation in python.
Again see my comment. If an LLM does Task X with a trillion unlabeled parameters and (some other thing) does the same Task X with “only” a billion unlabeled parameters, then both are inscrutable.
Your example of modular arithmetic is not a central example of what we should expect to happen, because “modular arithmetic in python” has zero unlabeled parameters. Realistically, an AGI won’t be able to accomplish any real-world task at all with zero unlabeled parameters.
I propose that a more realistic example would be “classifying images via a ConvNet with 100,000,000 weights” versus “classifying images via 5,000,000 lines of Python code involving 1,000,000 nonsense variable names”. The latter is obviously less inscrutable on the margin but it’s not a huge difference.
The goal is to understand how intelligence works. Clearly that would be very useful for alignment?
If “very useful for alignment” means “very useful for doing technical alignment research”, then yes, clearly.
If “very useful for alignment” means “increases our odds of winding up with aligned AGI”, then no, I don’t think it’s true, let alone “clearly” true.
If you don’t understand how something can simultaneously both be very useful for doing technical alignment research and decrease our odds of winding up with aligned AGI, here’s a very simple example. Suppose I posted the source code for misaligned ASI on github tomorrow. “Clearly that would be very useful” for doing technical alignment research, right? Who could disagree with that? It would open up all sorts of research avenues. But also, it would also obviously doom us all.
E.g. I could define theoretically a general algoritm that identifies the minimum concrepts neccesary, if I know enough about the structure of the system, specifically how concepts are stored, for solving a task. That’s of cause not perfect, but it would seem that for very many problems it would make the AI unable to think about things like human manipulation, or that it is a constrained AI, even if that knowledge was somewhere in a learned black box world model.
There’s a very basic problem that instrumental convergence is convergent because it’s actually useful. If you look at the world and try to figure out the best way to design a better solar cell, that best way involves manipulating humans (to get more resources to run more experiments etc.).
Humans are part of the environment. If an algorithm can look at a street and learn that there’s such a thing as cars, the very same algorithm will learn that there’s such a thing as humans. And if an algorithm can autonomously figure out how an engine works, the very same algorithm can autonomously figure out human psychology.
You could remove humans from the training data, but that leads to its own problems, and anyway, you don’t need to “understand intelligence” to recognize that as a possibility (e.g. here’s a link to some prior discussion of that).
Or you could try to “find” humans and human manipulation in the world-model, but then we have interpretability challenges.
Or you could assume that “humans” were manually put into the world-model as a separate module, but then we have the problem that world-models need to be learned from unlabeled data for practical reasons, and humans could also show up in the other modules.
Anyway, it’s fine to brainstorm on things like this, but I claim that you can do that brainstorming perfectly well by assuming that the world model is a Bayes net (or use OpenCog AtomSpace, or Soar, or whatever), or even just talk about it generically.
If your system is some plain code with for loops, just reduce the number the for loops of seach processes do. Now decreasing/incleasing the iterations somewhat will correspond to making the system dumber/smarter. Again obviously not solving the problem completely, but clearly a powerful thing to be able to do.
I’m 100% confident that, whatever AGI winds up looking like, “we could just make it dumber” will be on the table as an option. We can give it less time to find a solution to a problem, and then the solution it finds (if any) will be worse. We can give it less information to go on. Etc.
You don’t have to “understand intelligence” to recognize that we’ll have options like that. It’s obvious. That fact doesn’t come up very often in conversation because it’s not all that useful for getting to Safe and Beneficial AGI.
Again, if you assume the world model is a Bayes net (or use OpenCog AtomSpace, or Soar), I think you can do all the alignment thinking and brainstorming that you want to do, without doing new capabilities research. And I think you’d be more likely (well, less unlikely) to succeed anyway.
The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters. By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don’t know) of how to make the world model not contain dangerous minds.
E.g. imagine the AI is really good at world modeling, and now it models you (you are part of the world) so accurately that you are now basically copied into the AI. Now you might try to escape the AI, which would actually be really good because then you could save the world as a speed intelligence (assuming the model of you would really accurate which is probably wouldn’t be). But if it models another mind (maybe it considers dangerous adversaries) then maybe they could also escape, and would not be aligned.
By understanding the system you could put constraints on what world models can be generated, such that all generated world models can’t contain such dangerous minds, or at least make such minds much less likely.
I propose that a more realistic example would be “classifying images via a ConvNet with 100,000,000 weights” versus “classifying images via 5,000,000 lines of Python code involving 1,000,000 nonsense variable names”. The latter is obviously less inscrutable on the margin but it’s not a huge difference.
Python code is a discrete structure. You can do proofs on more easily than for a NN. You could try to apply program transformations on it that preserve functional equality, trying to optimize for some measure of “human understandable structure”. There are image classification alogrithms iirc that are worse than NN but much more interpretable, and these algorithms would at most be hundets of lines of code I guess (haven’t really looked a lot at them).
Anyway, it’s fine to brainstorm on things like this, but I claim that you can do that brainstorming perfectly well by assuming that the world model is a Bayes net (or use OpenCog AtomSpace, or Soar, or whatever), or even just talk about it generically.
You give examples of recognizing problems. I tried to give examples of how you can solve these problems. I’m not brainstorming on “how could this system fail”. Instead I understand something, and then I just notice without really trying, that now I can do a thing that seems very useful, like making the system not think about human psycology given certain constraints.
Probably I completely failed at making clear why I think that, because my explanation was terrible. In any case I think your suggested brainstorming this is completely different from the thing that I am actually doing.
To me it just seems that limiting the depth of a tree search is better that limiting the compute of a black box neural network. It seems like you can get a much better grip on what it means to limit the depth, and what this implies about the system behavior, when you actually understand how tree search works. Of cause tree search here is only an example.
The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters.
Yup, this is what we’re used to today:
there’s an information repository,
there’s a learning algorithm that updates the information repository,
there’s an inference algorithm that queries the information repository,
both the learning algorithm and the inference algorithm consist of legible code written by humans, with no inscrutable unlabeled parameters,
the high-dimensional space [or astronomically-large set, if it’s discrete] of all possible configurations of the information repository is likewise defined by legible code written by humans, with no inscrutable unlabeled parameters,
the only inscrutable unlabeled parameters are in the content of the information repository, after the learning algorithm has been running for a while.
So for example, in LLM pretraining, the learning algorithm is backprop, the inference algorithm is a forward pass, and the information repository is the weights of a transformer-architecture neural net. There’s nothing inscrutable about backprop, nor about a forward pass. We fully understand what those are doing and how. Backprop calculates the gradient, etc.
That’s just one example. There are many other options! The learning algorithm could involve TD learning. The inference algorithm could involve tree search, or MCMC, or whatever. The information repository could involve a learned value function and/or a learned policy and/or a learned Bayes net and/or a learned OpenCog AtomSpace or whatever. But in all cases, those six bullets above are valid.
So anyway, this is already how ML works, and I’m very confident that it will remain true until TAI, for reasons here. And this is a widespread consensus.
By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don’t know) of how to make the world model not contain dangerous minds.
There’s a very obvious failure mode in which: the world-model models the world, and the planner plans, and the value function calculates values, etc. … and at the end of all that, the AI system as a whole hatches and executes a plan to wipe out humanity. The major unsolved problem is: how do we confidently avoid that?
Then separately, there’s a different, weird, exotic type of failure mode, where, for example, there’s a full-fledged AGI agent, one that can do out-of-the-box foresighted planning etc., but this agent is not working within the designed AGI architecture (where the planner plans etc. as above), but rather the whole agent is hiding entirely within the world-model. I think that, in this kind of system, the risk of this exotic failure mode is very low, and can be straightforwardly mitigated to become even lower still. I wrote about it a long time ago at Thoughts on safety in predictive learning.
I really think we should focus first and foremost on the very obvious failure mode, which again is an unsolved problem that is very likely to manifest, and we should put aside the weird exotic failure mode at least until we’ve solved the big obvious one.
When we put aside the exotic failure mode and focus on the main one, then we’re no longer worried about “the world model contains dangerous minds”, but rather we’re worried about “something(s) in the world model has been flagged as desirable, that shouldn’t have been flagged as desirable”. This is a hard problem not only because of the interpretability issue (I think we agree that the contents of the world-model are inscrutable, and I hope we agree that those inscrutable contents will include both good things and bad things), but also because of concept extrapolation / goal misgeneralization (i.e., the AGI needs to have opinions about plans that bring it somewhere out of distribution). It’s great if you want to think about that problem, but you don’t need to “understand intelligence” for that, you can just assume that the world-model is a Bayes net or whatever, and jump right in! (Maybe start here!)
To me it just seems that limiting the depth of a tree search is better that limiting the compute of a black box neural network. It seems like you can get a much better grip on what it means to limit the depth, and what this implies about the system behavior, when you actually understand how tree search works. Of cause tree search here is only an example.
Right, but the ability to limit the depth of a tree search is basically useless for getting you to safe and beneficial AGI, because you don’t know the depth that allows dangerous plans, nor do you know that dangerous plans won’t actually be simpler (less depth) than intended plans. This is a very general problem. This problem applies equally well to limiting the compute of a black box, limiting the number of steps of MCMC, limiting the amount of (whatever OpenCog AtomSpace does), etc.
[You can also potentially use tree search depth to try to enforce guarantees about myopia, but that doesn’t really work for other reasons.]
Python code is a discrete structure. You can do proofs on more easily than for a NN. You could try to apply program transformations on it that preserve functional equality, trying to optimize for some measure of “human understandable structure”. There are image classification alogrithms iirc that are worse than NN but much more interpretable, and these algorithms would at most be hundets of lines of code I guess (haven’t really looked a lot at them).
“Hundreds of lines” is certainly wrong because you can recognize easily tens of thousands of distinct categories of visual objects. Probably hundreds of thousands.
Proofs sound nice, but what do you think you can realistically prove that will help with Safe and Beneficial AGI? You can’t prove things about what AGI will do in the real world, because the real world will not be encoded in your formal proof system. (pace davidad).
“Applying program transformations that optimize for human understandable structure” sounds nice, but only gets you to “inscrutable” from “even more inscrutable”. The visual world is complex. The algorithm can’t be arbitrarily simple, while still capturing that complexity. Cf. “computational irreducibility”.
I’m not brainstorming on “how could this system fail”. Instead I understand something, and then I just notice without really trying, that now I can do a thing that seems very useful, like making the system not think about human psycology given certain constraints.
What I’m trying to do in this whole comment is point you towards various “no-go theorems” that Eliezer probably figured out in 2006 and put onto Arbital somewhere.
Here’s an analogy. It’s appealing to say: “I don’t understand string theory, but if I did, then I would notice some new obvious way to build a perpetual motion machine.”. But no, you won’t. We can rule out perpetual motion machines from very general principles that don’t rely on how string theory works.
By the same token, it’s appealing to say: “I don’t understand intelligence, but if I did, then I would notice some new obvious way to guarantee that an AGI won’t try to manipulate humans.”. But no, you won’t. There are deep difficulties that we know you’re going to run into, based on very general principles that don’t rely on the data format for the world-model etc.
I suggest to think harder about the shape of the solution—getting all the way to Safe & Beneficial AGI. I think you’ll come to realize that figuring out the data format for the world-model etc. is not only dangerous (because it’s AGI capabilities research) but doesn’t even help appreciably with safety anyway.
I didn’t read it very carefully but how would you respond to the dilemma:
If the programmer has to write things like “tires are black” into the source code, then it’s totally impractical. (…pace davidad & Doug Lenat.)
If the programmer doesn’t have to write things like “tires are black” into the source code, then presumably a learning algorithm is figuring out things like “tires are black” from unlabeled data. And then you’re going to wind up with some giant data structure full of things like “ENTITY 92852384 implies ENTITY 8593483 with probability 0.36”. And then we have an alignment problem because the AI’s goals will be defined in terms of these unlabeled entities which are hard to interpret, and where it’s hard to guess how they’ll generalize after reflection, distributional shifts, etc.
I’m guessing you’re in the second bullet but I’m not sure how you’re thinking about this alignment concern.
If you had a system with “ENTITY 92852384 implies ENTITY 8593483” it would be a lot of progress, as currently in neural networks we don’t even understand the interal structures.
I want to have an algorithm that creates a world model. The world is large. A world model is uninterpretable by default through it’s sheer size, even if you had interpretable but low level abels. By default we don’t get any interpretable labels. I think there are ways to have generic dataprocessing procedures that don’t talk about the human mind at all, that would yield more interpretable world model. Similar to how you could probably specify some very general property about python programs, such that that program becomes easier to understand by humans. E.g. a formalism of what it means that the control flow is straightforward: Don’t use goto in C.
But even if you wouldn’t have this, understanding the system still allows you to understand what the structure of the knowledge would be. It seems plausible that one could simply by understanding the system very well, make it such that the learned datastrucutres need to take particular shapes, such that these shapes correspond some relevant alignment properties.
In any case, it seems that this is a problem that any possible way to build an intelligence runs into? So I don’t think it is a case against the project. When building an AI with NN you might not even think about that the interal representations might be wierd and alien (even for an LLM trained on human text)[1], but the same problem persists.
I haven’t looked into this, or thought about at all, though that’s what I expect.
See Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc, including my comment on it. If your approach would lead to a world-model that is an uninterpretable inscrutable mess, and LLM research would lead to a world-model that is an even more uninterpretable, even more inscrutable mess, then I don’t think this is a reason to push forward on your approach, without a good alignment plan.
Yes, it’s a pro tanto reason to prefer your approach, other things equal. But it’s a very minor reason. And other things are not equal. On the contrary, there are a bunch of important considerations plausibly pushing in the opposite direction:
Maybe LLMs will plateau anyway, so the comparison between inscrutable versus even-more-inscrutable is a moot point. And then you’re just doing AGI capabilities research for no safety benefit at all. (See “Endgame safety” for AGI.)
LLMs at least arguably have some safety benefits related to reliance on human knowledge, human concepts, and chains-of-thought, whereas the kind of AGI you’re trying to invent might not have those.
Your approach would (if “successful”) be much, much more compute-efficient—probably by orders of magnitude—see Section 3 here for a detailed explanation of why. This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder. (Related: I for one think AGI is possible on a single consumer GPU, see here.)
Likewise, your approach would (if “successful”) have a “better” inductive bias, “better” sample efficiency, etc., because you’re constraining the search space. That suggests fast takeoff and less likelihood of a long duration of janky mediocre-human-level AGIs. I think most people would see that as net bad for safety.
If it’s a problem for any possible approach to building AGI, then it’s an argument against pursuing any kind of AGI capabilities research! Yes! It means we should focus first on solving that problem, and only do AGI capabilities research when and if we succeed. And that’s what I believe. Right?
I don’t think this is plausible. I think alignment properties are pretty unrelated to the low-level structure out of which a world-model is built. For example, the difference between “advising a human” versus “manipulating a human”, and the difference between “finding a great out-of-the-box solution” versus “reward hacking”, are both extremely important for alignment. But you won’t get insight into those distinctions, or how to ensure them in an AGI, by thinking about whether world-model stuff is stored as connections on graphs versus induction heads or whatever.
Anyway, if your suggestion is true, I claim you can (and should) figure that out without doing AGI capabilities research. Here’s an example. Assume that the the learned data structure is a Bayes net, or some generalization of a Bayes net, or the OpenCog “AtomSpace”, or whatever. OK, now spend as long as you like thinking about what if anything that has to do with “alignment properties”. My guess is “very little”. Or if you come up with anything, you can share it. That’s not advancing capabilities, because people already know that there is such a thing as Bayes nets / OpenCog / whatever.
Alternatively, another concrete thing that you can chew on is: brain-like AGI. :) We already know a lot about how it works without needing to do any new capabilities research. For example, you might start with Plan for mediocre alignment of brain-like [model-based RL] AGI and think about how to make that approach better / less bad.
John’s post is quite wierd, because it only says true things, and implicitly implies a conclusion, namely that NNs are not less interpretable than some other thing, which is totally wrong.
Example: A neural network implements modular arithmetic with furier transforms. If you implement that furier algorithm in python, it’s harder to understand for a human than the obvious modular arithmetic implementation in python.
It doesn’t matter if the world model is inscruitable when looking directly at it, if you can change the generating code such that certain properties must hold. Figuring out what these properties is not directly solved by understading intelligence of cause.
This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder.
Totally agree, so obviously try super hard to not leak the working AGI code if you had it.
No you can. E.g. I could define theoretically a general algoritm that identifies the minimum concrepts neccesary, if I know enough about the structure of the system, specifically how concepts are stored, for solving a task. That’s of cause not perfect, but it would seem that for very many problems it would make the AI unable to think about things like human manipulation, or that it is a constrained AI, even if that knowledge was somewhere in a learned black box world model. This is just an example of something you can do by knowing the structure of a system.
If your system is some plain code with for loops, just reduce the number the for loops of seach processes do. Now decreasing/incleasing the iterations somewhat will correspond to making the system dumber/smarter. Again obviously not solving the problem completely, but clearly a powerful thing to be able to do.
Of cause many low level details do not matter. Often you’d only care that something is a sequence, or a set. I am talking about a higher level program structure.
It feels like you are somewhat missing the point. The goal is to understand how intelligence works. Clearly that would be very useful for alignment? Even if you would get a blackbox world model. But of cause it would also enable you to think about how to make such a world model more interpretable. I think that is possible, it’s just not what I am focusing on now.
Again see my comment. If an LLM does Task X with a trillion unlabeled parameters and (some other thing) does the same Task X with “only” a billion unlabeled parameters, then both are inscrutable.
Your example of modular arithmetic is not a central example of what we should expect to happen, because “modular arithmetic in python” has zero unlabeled parameters. Realistically, an AGI won’t be able to accomplish any real-world task at all with zero unlabeled parameters.
I propose that a more realistic example would be “classifying images via a ConvNet with 100,000,000 weights” versus “classifying images via 5,000,000 lines of Python code involving 1,000,000 nonsense variable names”. The latter is obviously less inscrutable on the margin but it’s not a huge difference.
If “very useful for alignment” means “very useful for doing technical alignment research”, then yes, clearly.
If “very useful for alignment” means “increases our odds of winding up with aligned AGI”, then no, I don’t think it’s true, let alone “clearly” true.
If you don’t understand how something can simultaneously both be very useful for doing technical alignment research and decrease our odds of winding up with aligned AGI, here’s a very simple example. Suppose I posted the source code for misaligned ASI on github tomorrow. “Clearly that would be very useful” for doing technical alignment research, right? Who could disagree with that? It would open up all sorts of research avenues. But also, it would also obviously doom us all.
For more on this topic, see my post “Endgame safety” for AGI.
There’s a very basic problem that instrumental convergence is convergent because it’s actually useful. If you look at the world and try to figure out the best way to design a better solar cell, that best way involves manipulating humans (to get more resources to run more experiments etc.).
Humans are part of the environment. If an algorithm can look at a street and learn that there’s such a thing as cars, the very same algorithm will learn that there’s such a thing as humans. And if an algorithm can autonomously figure out how an engine works, the very same algorithm can autonomously figure out human psychology.
You could remove humans from the training data, but that leads to its own problems, and anyway, you don’t need to “understand intelligence” to recognize that as a possibility (e.g. here’s a link to some prior discussion of that).
Or you could try to “find” humans and human manipulation in the world-model, but then we have interpretability challenges.
Or you could assume that “humans” were manually put into the world-model as a separate module, but then we have the problem that world-models need to be learned from unlabeled data for practical reasons, and humans could also show up in the other modules.
Anyway, it’s fine to brainstorm on things like this, but I claim that you can do that brainstorming perfectly well by assuming that the world model is a Bayes net (or use OpenCog AtomSpace, or Soar, or whatever), or even just talk about it generically.
I’m 100% confident that, whatever AGI winds up looking like, “we could just make it dumber” will be on the table as an option. We can give it less time to find a solution to a problem, and then the solution it finds (if any) will be worse. We can give it less information to go on. Etc.
You don’t have to “understand intelligence” to recognize that we’ll have options like that. It’s obvious. That fact doesn’t come up very often in conversation because it’s not all that useful for getting to Safe and Beneficial AGI.
Again, if you assume the world model is a Bayes net (or use OpenCog AtomSpace, or Soar), I think you can do all the alignment thinking and brainstorming that you want to do, without doing new capabilities research. And I think you’d be more likely (well, less unlikely) to succeed anyway.
The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters. By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don’t know) of how to make the world model not contain dangerous minds.
E.g. imagine the AI is really good at world modeling, and now it models you (you are part of the world) so accurately that you are now basically copied into the AI. Now you might try to escape the AI, which would actually be really good because then you could save the world as a speed intelligence (assuming the model of you would really accurate which is probably wouldn’t be). But if it models another mind (maybe it considers dangerous adversaries) then maybe they could also escape, and would not be aligned.
By understanding the system you could put constraints on what world models can be generated, such that all generated world models can’t contain such dangerous minds, or at least make such minds much less likely.
Python code is a discrete structure. You can do proofs on more easily than for a NN. You could try to apply program transformations on it that preserve functional equality, trying to optimize for some measure of “human understandable structure”. There are image classification alogrithms iirc that are worse than NN but much more interpretable, and these algorithms would at most be hundets of lines of code I guess (haven’t really looked a lot at them).
You give examples of recognizing problems. I tried to give examples of how you can solve these problems. I’m not brainstorming on “how could this system fail”. Instead I understand something, and then I just notice without really trying, that now I can do a thing that seems very useful, like making the system not think about human psycology given certain constraints.
Probably I completely failed at making clear why I think that, because my explanation was terrible. In any case I think your suggested brainstorming this is completely different from the thing that I am actually doing.
To me it just seems that limiting the depth of a tree search is better that limiting the compute of a black box neural network. It seems like you can get a much better grip on what it means to limit the depth, and what this implies about the system behavior, when you actually understand how tree search works. Of cause tree search here is only an example.
Yup, this is what we’re used to today:
there’s an information repository,
there’s a learning algorithm that updates the information repository,
there’s an inference algorithm that queries the information repository,
both the learning algorithm and the inference algorithm consist of legible code written by humans, with no inscrutable unlabeled parameters,
the high-dimensional space [or astronomically-large set, if it’s discrete] of all possible configurations of the information repository is likewise defined by legible code written by humans, with no inscrutable unlabeled parameters,
the only inscrutable unlabeled parameters are in the content of the information repository, after the learning algorithm has been running for a while.
So for example, in LLM pretraining, the learning algorithm is backprop, the inference algorithm is a forward pass, and the information repository is the weights of a transformer-architecture neural net. There’s nothing inscrutable about backprop, nor about a forward pass. We fully understand what those are doing and how. Backprop calculates the gradient, etc.
That’s just one example. There are many other options! The learning algorithm could involve TD learning. The inference algorithm could involve tree search, or MCMC, or whatever. The information repository could involve a learned value function and/or a learned policy and/or a learned Bayes net and/or a learned OpenCog AtomSpace or whatever. But in all cases, those six bullets above are valid.
So anyway, this is already how ML works, and I’m very confident that it will remain true until TAI, for reasons here. And this is a widespread consensus.
There’s a very obvious failure mode in which: the world-model models the world, and the planner plans, and the value function calculates values, etc. … and at the end of all that, the AI system as a whole hatches and executes a plan to wipe out humanity. The major unsolved problem is: how do we confidently avoid that?
Then separately, there’s a different, weird, exotic type of failure mode, where, for example, there’s a full-fledged AGI agent, one that can do out-of-the-box foresighted planning etc., but this agent is not working within the designed AGI architecture (where the planner plans etc. as above), but rather the whole agent is hiding entirely within the world-model. I think that, in this kind of system, the risk of this exotic failure mode is very low, and can be straightforwardly mitigated to become even lower still. I wrote about it a long time ago at Thoughts on safety in predictive learning.
I really think we should focus first and foremost on the very obvious failure mode, which again is an unsolved problem that is very likely to manifest, and we should put aside the weird exotic failure mode at least until we’ve solved the big obvious one.
When we put aside the exotic failure mode and focus on the main one, then we’re no longer worried about “the world model contains dangerous minds”, but rather we’re worried about “something(s) in the world model has been flagged as desirable, that shouldn’t have been flagged as desirable”. This is a hard problem not only because of the interpretability issue (I think we agree that the contents of the world-model are inscrutable, and I hope we agree that those inscrutable contents will include both good things and bad things), but also because of concept extrapolation / goal misgeneralization (i.e., the AGI needs to have opinions about plans that bring it somewhere out of distribution). It’s great if you want to think about that problem, but you don’t need to “understand intelligence” for that, you can just assume that the world-model is a Bayes net or whatever, and jump right in! (Maybe start here!)
Right, but the ability to limit the depth of a tree search is basically useless for getting you to safe and beneficial AGI, because you don’t know the depth that allows dangerous plans, nor do you know that dangerous plans won’t actually be simpler (less depth) than intended plans. This is a very general problem. This problem applies equally well to limiting the compute of a black box, limiting the number of steps of MCMC, limiting the amount of (whatever OpenCog AtomSpace does), etc.
[You can also potentially use tree search depth to try to enforce guarantees about myopia, but that doesn’t really work for other reasons.]
“Hundreds of lines” is certainly wrong because you can recognize easily tens of thousands of distinct categories of visual objects. Probably hundreds of thousands.
Proofs sound nice, but what do you think you can realistically prove that will help with Safe and Beneficial AGI? You can’t prove things about what AGI will do in the real world, because the real world will not be encoded in your formal proof system. (pace davidad).
“Applying program transformations that optimize for human understandable structure” sounds nice, but only gets you to “inscrutable” from “even more inscrutable”. The visual world is complex. The algorithm can’t be arbitrarily simple, while still capturing that complexity. Cf. “computational irreducibility”.
What I’m trying to do in this whole comment is point you towards various “no-go theorems” that Eliezer probably figured out in 2006 and put onto Arbital somewhere.
Here’s an analogy. It’s appealing to say: “I don’t understand string theory, but if I did, then I would notice some new obvious way to build a perpetual motion machine.”. But no, you won’t. We can rule out perpetual motion machines from very general principles that don’t rely on how string theory works.
By the same token, it’s appealing to say: “I don’t understand intelligence, but if I did, then I would notice some new obvious way to guarantee that an AGI won’t try to manipulate humans.”. But no, you won’t. There are deep difficulties that we know you’re going to run into, based on very general principles that don’t rely on the data format for the world-model etc.
I suggest to think harder about the shape of the solution—getting all the way to Safe & Beneficial AGI. I think you’ll come to realize that figuring out the data format for the world-model etc. is not only dangerous (because it’s AGI capabilities research) but doesn’t even help appreciably with safety anyway.