Solving interpretability with an AGI (even with humans-in-the-loop) might not lead to particularly great insights on a general alignment theory or even on how to specifically align a particular AGI
Wouldn’t it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.
It seem to need another assumption, namely that the AGI has sufficient control of its internal state and knowledge of the detection network to be able to bring itself into a state that produces interpretation that trips detection network, while also allowing the AGI to advance its agenda.
Wouldn’t it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.
I really don’t expect “goals” to be explicitly written down in the network. There will very likely not be a thing that says “I want to predict the next token” or “I want to make paperclips” or even a utility function of that. My mental image of goals is that they are put “on top” of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.
Anyway, detecting goals, detecting deceit, detecting hidden knowledge of the system is a good thing to have. Interpretability of those things are needed. But interpretability cuts both ways, and with a full-interpretable AGI, foom seems to be a great danger. That’s what I wanted to point out. With a fast intelligence explosion (that doesn’t need slow retraining or multiple algorithmic breakthrough) the capabilities will explode alongside, while alignment won’t.
It seem to need another assumption, namely that the AGI have sufficient control of its internal state and knowledge of the detection network to be able to bring itself into a state that produces interpretation that trips detection network, while also allowing the AGI to advance its agenda.
It is not clear to me, what you are referring to, here. Do you think we will have detection networks? Detection for what? Deceit? We might literally have the AGI look inside for a purpose (like in the new OpenAI paper). I hope we have something like a thing that tells us if it wants to self-modify, but if nobody points out the danger of foom, we likely won’t have that.
I really don’t expect “goals” to be explicitly written down in the network. There will very likely not be a thing that says “I want to predict the next token” or “I want to make paperclips” or even a utility function of that. My mental image of goals is that they are put “on top” of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.
I’m sure that I don’t understand you. GPT most likely doesn’t have “I want to predict next token” written somewhere, because it doesn’t want to predict next token. There’s nothing in there that will actively try to predict next token no matter what. It’s just the thing it does when it runs.
Is it possible to have a system that just “actively try to make paperclips no matter what” when it runs, but it doesn’t reflect it in its reasoning and planning? I have a feeling that it requires God-level sophistication and knowledge of the universe to create a device that can act like that, when the device just happens to act in a way that robustly maximizes paperclips while not containing anything that can be interpreted as that goal.
I found that I can’t precisely formulate why I feel that. Maybe I’ll be able to express that in a few weeks (or I’ll find that the feeling is misguided).
A system that looks like “actively try to make paperclips no matter what” seems like the sort of thing that an evolution-like process could spit out pretty easily. A system that looks like “robustly maximize paperclips no matter what” maybe not so much.
I expect it’s a lot easier to make a thing which consistently executes actions which have worked in the past than to make a thing that models the world well enough to calculate expected value over a bunch of plans and choose the best one, and have that actually work (especially if there are other agents in the world, even if those other agents aren’t hostile—see the winner’s curse).
I feel the exact opposite! Creating something that seems to maximise something without having a clear idea of what its goal is really natural IMO. You said it yourself, GPT “”wants”″ to predict the correct probability distribution of the next token, but there is probably not a thing inside actively maximising for that, instead it’s very likely to be a bunch of weird heuristics that were selected by the training method because they work.
If you instead meant that GPT is “just an algorithm” I feel we disagree here as I am pretty sure that I am just an algorithm myself.
Look at us! We can clearly model a single human as to having a utility function (k maybe given their limited intelligence it’s actually hard) but we don’t know what our utility actually is. I think Rob Miles made a video about that iirc.
My understanding is that the utility function and expected utility maximiser is basically the theoretical pinnacle of intelligence! Not your standard human or GPT or near-future AGI. We are also quite myopic (and whatever near-future AGI we make will also be myopic at first).
Is it possible to have a system that just “actively try to make paperclips no matter what” when it runs, but it doesn’t reflect it in its reasoning and planning?
I’d say that it can reflect about its reasoning and planning, but it just plaster the universe with tiny molecular spirals because it just like that more than keeping humans alive.
I think this tweet by EY https://twitter.com/ESYudkowsky/status/1654141290945331200 shows what I mean. We don’t know what the ultimate dog is, we don’t know what we would have created if we did have the capabilities to make a dog-like thing from scratch. We didn’t create ice-cream because it maximise our utility function. We just stumbled on its invention and found that it is really yummy.
But I really don’t want to adventure myself in this, I am writing something similar to these points in order to deconfuse myself, it is not exactly clear to me the divide between agent meant in the theoretical sense and real systems.
So to keep the discussion on-topic, what I think is:
interpretability to “correct” the system: good, but be careful pls
You said it yourself, GPT “”wants”″ to predict the correct probability distribution of the next token
No, I said that GPT does predict next token, while probably not containing anything that can be interpreted as “I want to predict next token”. Like a bacterium does divide (with possible adaptive mutations), while not containing “be fruitful and multiply” written somewhere inside.
If you instead meant that GPT is “just an algorithm”
No, I certainly didn’t mean that. If the extended Church—Turing thesis holds for macroscopic behavior of our bodies, we can indeed be represented as Turing-machine algorithms (with polynomial multiplier on efficiency).
What I feel, but can’t precisely convey, is that there’s a huge gulf (in computational complexity maybe) between agentic systems (that do have explicit internal representation of, at least, some of their goals) and “zombie-agentic” systems (that act like agents with goals, but have no explicit internal representation of those goals).
we don’t know what our utility actually is
How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent’s shortcomings)?
Disclaimer: These are all hard questions and points that I don’t know their true answers, these are just my views, what I have understood up to now. I haven’t studied the expected utility maximisers exactly because I don’t expect the abstraction to be useful for the kind of AGI we are going to be making.
There’s a huge gulf between agentic systems and “zombie-agentic” systems (that act like agents with goals, but have no explicit internal representation of those goals)
I feel the same, but I would say that it’s the “real-agentic” system (or a close approximation of it) that needs God-level knowledge of cognitive systems (why orthodox alignment by building the whole mind from theory is really hard). An evolved system like us or like GPT, IMO, seems more close to a “zombie-agentic” system. I feel the key thing to understand each other might be coherence, and how coherence can vary from introspection, but I am not knowledgeable enough to delve into this right now.
How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent’s shortcomings)?
The view in my mind that makes sense is that a utility function is an abstraction that you put on top of basically anything if you wish. It’s a hat to describe a system that does things in the most general way. The framework is borrowed from economics where human behaviour is modelled with more or less complicated utility functions, but whether there is or not an internal representation is mostly irrelevant. And, again, I don’t expect a DL system do display anything remotely close to a “goal circuit”, but that we can still describe them as having a utility function and them being maximisers (of not infinite cognition power) of that UF. But the UF, form our part, would be just a guess. I don’t expect us to crack that with interpretability of neural networks learned by gradient descent.
What I meant to articulate was: the utility function and expected utility maximiser is a great framework to think about intelligent agents, but it’s a theory put on top of the system, it doesn’t need to be internal. In fact that system is incomputable (you need an hypercomputer to make the right decision).
Wouldn’t it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.
It seem to need another assumption, namely that the AGI has sufficient control of its internal state and knowledge of the detection network to be able to bring itself into a state that produces interpretation that trips detection network, while also allowing the AGI to advance its agenda.
I really don’t expect “goals” to be explicitly written down in the network. There will very likely not be a thing that says “I want to predict the next token” or “I want to make paperclips” or even a utility function of that. My mental image of goals is that they are put “on top” of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.
Anyway, detecting goals, detecting deceit, detecting hidden knowledge of the system is a good thing to have. Interpretability of those things are needed. But interpretability cuts both ways, and with a full-interpretable AGI, foom seems to be a great danger. That’s what I wanted to point out. With a fast intelligence explosion (that doesn’t need slow retraining or multiple algorithmic breakthrough) the capabilities will explode alongside, while alignment won’t.
It is not clear to me, what you are referring to, here. Do you think we will have detection networks? Detection for what? Deceit? We might literally have the AGI look inside for a purpose (like in the new OpenAI paper). I hope we have something like a thing that tells us if it wants to self-modify, but if nobody points out the danger of foom, we likely won’t have that.
I’m sure that I don’t understand you. GPT most likely doesn’t have “I want to predict next token” written somewhere, because it doesn’t want to predict next token. There’s nothing in there that will actively try to predict next token no matter what. It’s just the thing it does when it runs.
Is it possible to have a system that just “actively try to make paperclips no matter what” when it runs, but it doesn’t reflect it in its reasoning and planning? I have a feeling that it requires God-level sophistication and knowledge of the universe to create a device that can act like that, when the device just happens to act in a way that robustly maximizes paperclips while not containing anything that can be interpreted as that goal.
I found that I can’t precisely formulate why I feel that. Maybe I’ll be able to express that in a few weeks (or I’ll find that the feeling is misguided).
A system that looks like “actively try to make paperclips no matter what” seems like the sort of thing that an evolution-like process could spit out pretty easily. A system that looks like “robustly maximize paperclips no matter what” maybe not so much.
I expect it’s a lot easier to make a thing which consistently executes actions which have worked in the past than to make a thing that models the world well enough to calculate expected value over a bunch of plans and choose the best one, and have that actually work (especially if there are other agents in the world, even if those other agents aren’t hostile—see the winner’s curse).
I feel the exact opposite! Creating something that seems to maximise something without having a clear idea of what its goal is really natural IMO. You said it yourself, GPT “”wants”″ to predict the correct probability distribution of the next token, but there is probably not a thing inside actively maximising for that, instead it’s very likely to be a bunch of weird heuristics that were selected by the training method because they work.
If you instead meant that GPT is “just an algorithm” I feel we disagree here as I am pretty sure that I am just an algorithm myself.
Look at us! We can clearly model a single human as to having a utility function (k maybe given their limited intelligence it’s actually hard) but we don’t know what our utility actually is. I think Rob Miles made a video about that iirc.
My understanding is that the utility function and expected utility maximiser is basically the theoretical pinnacle of intelligence! Not your standard human or GPT or near-future AGI. We are also quite myopic (and whatever near-future AGI we make will also be myopic at first).
I’d say that it can reflect about its reasoning and planning, but it just plaster the universe with tiny molecular spirals because it just like that more than keeping humans alive.
I think this tweet by EY https://twitter.com/ESYudkowsky/status/1654141290945331200 shows what I mean. We don’t know what the ultimate dog is, we don’t know what we would have created if we did have the capabilities to make a dog-like thing from scratch. We didn’t create ice-cream because it maximise our utility function. We just stumbled on its invention and found that it is really yummy.
But I really don’t want to adventure myself in this, I am writing something similar to these points in order to deconfuse myself, it is not exactly clear to me the divide between agent meant in the theoretical sense and real systems.
So to keep the discussion on-topic, what I think is:
interpretability to “correct” the system: good, but be careful pls
interpretability for capabilities: bad
No, I said that GPT does predict next token, while probably not containing anything that can be interpreted as “I want to predict next token”. Like a bacterium does divide (with possible adaptive mutations), while not containing “be fruitful and multiply” written somewhere inside.
No, I certainly didn’t mean that. If the extended Church—Turing thesis holds for macroscopic behavior of our bodies, we can indeed be represented as Turing-machine algorithms (with polynomial multiplier on efficiency).
What I feel, but can’t precisely convey, is that there’s a huge gulf (in computational complexity maybe) between agentic systems (that do have explicit internal representation of, at least, some of their goals) and “zombie-agentic” systems (that act like agents with goals, but have no explicit internal representation of those goals).
How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent’s shortcomings)?
Disclaimer: These are all hard questions and points that I don’t know their true answers, these are just my views, what I have understood up to now. I haven’t studied the expected utility maximisers exactly because I don’t expect the abstraction to be useful for the kind of AGI we are going to be making.
I feel the same, but I would say that it’s the “real-agentic” system (or a close approximation of it) that needs God-level knowledge of cognitive systems (why orthodox alignment by building the whole mind from theory is really hard). An evolved system like us or like GPT, IMO, seems more close to a “zombie-agentic” system.
I feel the key thing to understand each other might be coherence, and how coherence can vary from introspection, but I am not knowledgeable enough to delve into this right now.
The view in my mind that makes sense is that a utility function is an abstraction that you put on top of basically anything if you wish. It’s a hat to describe a system that does things in the most general way. The framework is borrowed from economics where human behaviour is modelled with more or less complicated utility functions, but whether there is or not an internal representation is mostly irrelevant. And, again, I don’t expect a DL system do display anything remotely close to a “goal circuit”, but that we can still describe them as having a utility function and them being maximisers (of not infinite cognition power) of that UF. But the UF, form our part, would be just a guess. I don’t expect us to crack that with interpretability of neural networks learned by gradient descent.
What I meant to articulate was: the utility function and expected utility maximiser is a great framework to think about intelligent agents, but it’s a theory put on top of the system, it doesn’t need to be internal. In fact that system is incomputable (you need an hypercomputer to make the right decision).