I’m curious what’s Chris’s best guess (or anyone else’s) about where to place AlphaGo Zero on that diagram. Presumably its place is somewhere after “Human Performance”, but is it close to the “Crisp Abstractions” pick, or perhaps way further—somewhere in the realm of “Increasingly Alien Abstractions”?
Specifically, rather than using machine learning to build agents which directly take actions in the world, we could use ML as a microscope—a way of learning about the world without directly taking actions in it.
Is there an implicit assumption here that RL agents are generally more dangerous than models that are trained with (un)supervised learning?
(Later the OP contrasts microscopes with oracles, so perhaps Chris interprets a microscope as a model that is smaller, or otherwise somehow restricted, s.t. we know it’s safe?)
I’m curious what’s Chris’s best guess (or anyone else’s) about where to place AlphaGo Zero on that diagram
Without the ability to poke around at AlphaGo—and a lot of time to invest in doing so—I can only engage in wild speculation. It seems like it must have abstractions that human Go players don’t have or anticipate. This is true of even vanilla vision models before you invest lots of time in understanding them (I’ve learned more than I ever needed to about useful features for distinguishing dog species from ImageNet models).
But I’d hope the abstractions are in a regime where, with effort, humans can understand them. This is what I expect the slope downwards as we move towards “alien abstractions” to look like: we’ll see abstractions that are extremely useful if you can internalize them, but take more and more effort to understand.
Is there an implicit assumption here that RL agents are generally more dangerous than models that are trained with (un)supervised learning?
Yes, I believe that RL agents have a much wider range of accident concerns than supervised / unsupervised models.
Later the OP contrasts microscopes with oracles, so perhaps Chris interprets a microscope as a model that is smaller, or otherwise somehow restricted, s.t. we know it’s safe?
Gurkenglas provided a very eloquent description that matches why I believe this. I’ll continue discussion of this in that thread. :)
Yes, I believe that RL agents have a much wider range of accident concerns than supervised / unsupervised models.
Is there anything that prevents them from being used as microscopes though? Presumably you can still inspect the models it has learned without using it as an agent (after it’s been trained). Or am I missing something?
To me, the important safety feature of “microscope AI” is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit). This feature is totally incompatible with agents (you can’t vacuum the floor without modeling the consequences of your motor control settings), and optional for oracles [I’m using oracles in a broad sense of systems that you use to help answer your questions, leaving aside what their exact user interface is, so microscope AI is part of that]. For example, when Eliezer thinks about oracles he is not thinking this way; instead, he’s thinking of a system that deliberately chooses an output to “increase the correspondence between the user’s belief about relevant consequences and reality”. But there’s no reason in principle that we couldn’t build a system that will not apply its intelligent world-model to analyze the downstream consequences of its outputs.
I think the only way to do that is to have its user interface not be created automatically as part of the training objective, but rather build the in ourselves, separately. Then the two key questions are: What’s the safe training procedure that results in an intelligent world-model, and what’s the separate input-output interface that we’re going to build? Both of these are open questions AFAIK. I wrote Self-Supervised Learning and AGI Safety laying out this big picture as I see it.
For the latter question, what is the user interface, “Use interpretability tools & visualizations on the world-model” seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision. I hope that they don’t stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a “search through causal pathways to get desired consequences” interface. Incidentally, the people who think that brain-computer interfaces will help with AGI safety (cf waitbutwhy) seem to be banking on something vaguely like “microscope AI”, but I haven’t yet found any detailed discussion along those lines.
For the first question, what is the safe training procedure that incidentally creates a world-model, contra Gurkenglas’s comment here, I think it’s an open question whether a safe training procedure exists. For example, unsupervised (a.k.a. “self-supervised”) learning as ofer suggests seems awfully safe but is it really? See Self-Supervised Learning and Manipulative Predictions; I half-joked there about burying the computer in an underground bunker, running self-supervised learning under homomorphic encryption, until training was complete; then cutting power, digging it out, and inspecting the world model. But even then, an ambitious misaligned system could potentially leave manipulative booby-traps on its hard drive. Gurkenglas’s suggestion of telling it nothing about the universe (e.g. have it play Nomic) would make it possibly safer but dramatically less useful (it won’t understand the cause of Alzheimer’s etc.) And it can probably still learn quite a bit about the world by observing its own algorithm… I’m not sure, I’m still generally optimistic that a solution exists, and I hope that Gurkenglas and I and everyone else keeps thinking about it. :-)
To me, the important safety feature of “microscope AI” is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit).
As I mentioned in this comment, not modeling the consequences of its output is actually exactly what I want to get out of myopia.
For the latter question, what is the user interface, “Use interpretability tools & visualizations on the world-model” seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision.
Yep; me too!
I hope that they don’t stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a “search through causal pathways to get desired consequences” interface.
Chris (and the rest of Clarity) are definitely working on stuff like this!
unsupervised (a.k.a. “self-supervised”) learning as ofer suggests seems awfully safe but is it really?
I generally agree that unsupervised learning seems much safer than other approaches (e.g. RL), though I also agree that there are still concerns. See for example Abram’s recent “The Parable of Predict-O-Matic” and the rest of his Partial Agency sequence.
As I understood it, an Oracle AI is asked a question and produces an answer. A microscope is shown a situation and constructs an internal model that we then extract by reading its innards. Oracles must somehow be incentivized to give useful answers, microscopes cannot help but understand.
Oracles must somehow be incentivized to give useful answers
A microscope model must also be trained somehow, for example with unsupervised learning. Therefore, I expect such a model to also look like it’s “incentivized to give useful answers” (e.g. an answer to the question: “what is the next word in the text?”).
My understanding is that what distinguishes a microscope model is the way it is being used after it’s already trained (namely, allowing researchers to look at its internals for the purpose of gaining insights etcetera, rather than making inferences for the sake of using its valuable output). If this is correct, it seems that we should only use safe training procedures for the purpose of training useful microscopes, rather than training arbitrarily capable models.
Our usual objective is “Make it safe, and if we aligned it correctly make it useful.”. A microscope is useful even if it’s not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to learn about AI design from the inner optimizers it produces.
It could deduce that someone is trying to learn about AI design from its inner optimizers, and maybe it could deduce our laws of physics because they are the simplest ones that would try such, but quantum experiments show it cannot deduce its Everett branch.
Ideally, the tldrbot we set to interpret the results would use a random perspective onto the microscope so the attack also cannot be specialized on the perspective.
“Specifically, rather than using machine learning to build agents which directly take actions in the world, we could use ML as a microscope—a way of learning about the world without directly taking actions in it.”
Is there an implicit assumption here that RL agents are generally more dangerous than models that are trained with (un)supervised learning?
Couldn’t you use it as a microscope regardless of whether it was trained using RL or (un)supervised learning?
It seems to me that whether it’s a microscope would be about what you do with it after it’s trained. In other words, an RL agent only need be an agent during training. Once it’s trained you could still inspect the models it’s learned w/o hooking it up to any effectors.
However, Chris replied yes to this question, so maybe I’m missing something.
I’m not sure I understand the question, but in case it’s useful/relevant here:
A computer that trains an ML model/system—via something that looks like contemporary ML methods at an arbitrarily large scale—might be dangerous even if it’s not connected to anything. Humans might get manipulated (e.g. if researchers ever look at the learned parameters), mind crime might occur, acausal trading might occur, the hardware of the computer might be used to implement effectors in some fantastic way. And those might be just a tiny fraction of a large class of relevant risks that the majority of which we can’t currently understand.
Such ‘offline computers’ might be more dangerous than an RL agent that by design controls some actuators, because problems with the latter might be visible to us at a much lower scale of training (and therefore with much less capable/intelligent systems).
Thanks for writing this!
I’m curious what’s Chris’s best guess (or anyone else’s) about where to place AlphaGo Zero on that diagram. Presumably its place is somewhere after “Human Performance”, but is it close to the “Crisp Abstractions” pick, or perhaps way further—somewhere in the realm of “Increasingly Alien Abstractions”?
Is there an implicit assumption here that RL agents are generally more dangerous than models that are trained with (un)supervised learning?
(Later the OP contrasts microscopes with oracles, so perhaps Chris interprets a microscope as a model that is smaller, or otherwise somehow restricted, s.t. we know it’s safe?)
Without the ability to poke around at AlphaGo—and a lot of time to invest in doing so—I can only engage in wild speculation. It seems like it must have abstractions that human Go players don’t have or anticipate. This is true of even vanilla vision models before you invest lots of time in understanding them (I’ve learned more than I ever needed to about useful features for distinguishing dog species from ImageNet models).
But I’d hope the abstractions are in a regime where, with effort, humans can understand them. This is what I expect the slope downwards as we move towards “alien abstractions” to look like: we’ll see abstractions that are extremely useful if you can internalize them, but take more and more effort to understand.
Yes, I believe that RL agents have a much wider range of accident concerns than supervised / unsupervised models.
Gurkenglas provided a very eloquent description that matches why I believe this. I’ll continue discussion of this in that thread. :)
Is there anything that prevents them from being used as microscopes though? Presumably you can still inspect the models it has learned without using it as an agent (after it’s been trained). Or am I missing something?
To me, the important safety feature of “microscope AI” is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit). This feature is totally incompatible with agents (you can’t vacuum the floor without modeling the consequences of your motor control settings), and optional for oracles [I’m using oracles in a broad sense of systems that you use to help answer your questions, leaving aside what their exact user interface is, so microscope AI is part of that]. For example, when Eliezer thinks about oracles he is not thinking this way; instead, he’s thinking of a system that deliberately chooses an output to “increase the correspondence between the user’s belief about relevant consequences and reality”. But there’s no reason in principle that we couldn’t build a system that will not apply its intelligent world-model to analyze the downstream consequences of its outputs.
I think the only way to do that is to have its user interface not be created automatically as part of the training objective, but rather build the in ourselves, separately. Then the two key questions are: What’s the safe training procedure that results in an intelligent world-model, and what’s the separate input-output interface that we’re going to build? Both of these are open questions AFAIK. I wrote Self-Supervised Learning and AGI Safety laying out this big picture as I see it.
For the latter question, what is the user interface, “Use interpretability tools & visualizations on the world-model” seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision. I hope that they don’t stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a “search through causal pathways to get desired consequences” interface. Incidentally, the people who think that brain-computer interfaces will help with AGI safety (cf waitbutwhy) seem to be banking on something vaguely like “microscope AI”, but I haven’t yet found any detailed discussion along those lines.
For the first question, what is the safe training procedure that incidentally creates a world-model, contra Gurkenglas’s comment here, I think it’s an open question whether a safe training procedure exists. For example, unsupervised (a.k.a. “self-supervised”) learning as ofer suggests seems awfully safe but is it really? See Self-Supervised Learning and Manipulative Predictions; I half-joked there about burying the computer in an underground bunker, running self-supervised learning under homomorphic encryption, until training was complete; then cutting power, digging it out, and inspecting the world model. But even then, an ambitious misaligned system could potentially leave manipulative booby-traps on its hard drive. Gurkenglas’s suggestion of telling it nothing about the universe (e.g. have it play Nomic) would make it possibly safer but dramatically less useful (it won’t understand the cause of Alzheimer’s etc.) And it can probably still learn quite a bit about the world by observing its own algorithm… I’m not sure, I’m still generally optimistic that a solution exists, and I hope that Gurkenglas and I and everyone else keeps thinking about it. :-)
As I mentioned in this comment, not modeling the consequences of its output is actually exactly what I want to get out of myopia.
Yep; me too!
Chris (and the rest of Clarity) are definitely working on stuff like this!
I generally agree that unsupervised learning seems much safer than other approaches (e.g. RL), though I also agree that there are still concerns. See for example Abram’s recent “The Parable of Predict-O-Matic” and the rest of his Partial Agency sequence.
As I understood it, an Oracle AI is asked a question and produces an answer. A microscope is shown a situation and constructs an internal model that we then extract by reading its innards. Oracles must somehow be incentivized to give useful answers, microscopes cannot help but understand.
A microscope model must also be trained somehow, for example with unsupervised learning. Therefore, I expect such a model to also look like it’s “incentivized to give useful answers” (e.g. an answer to the question: “what is the next word in the text?”).
My understanding is that what distinguishes a microscope model is the way it is being used after it’s already trained (namely, allowing researchers to look at its internals for the purpose of gaining insights etcetera, rather than making inferences for the sake of using its valuable output). If this is correct, it seems that we should only use safe training procedures for the purpose of training useful microscopes, rather than training arbitrarily capable models.
Our usual objective is “Make it safe, and if we aligned it correctly make it useful.”. A microscope is useful even if it’s not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to learn about AI design from the inner optimizers it produces.
It could deduce that someone is trying to learn about AI design from its inner optimizers, and maybe it could deduce our laws of physics because they are the simplest ones that would try such, but quantum experiments show it cannot deduce its Everett branch.
Ideally, the tldrbot we set to interpret the results would use a random perspective onto the microscope so the attack also cannot be specialized on the perspective.
Couldn’t you use it as a microscope regardless of whether it was trained using RL or (un)supervised learning?
It seems to me that whether it’s a microscope would be about what you do with it after it’s trained. In other words, an RL agent only need be an agent during training. Once it’s trained you could still inspect the models it’s learned w/o hooking it up to any effectors.
However, Chris replied yes to this question, so maybe I’m missing something.
I’m not sure I understand the question, but in case it’s useful/relevant here:
A computer that trains an ML model/system—via something that looks like contemporary ML methods at an arbitrarily large scale—might be dangerous even if it’s not connected to anything. Humans might get manipulated (e.g. if researchers ever look at the learned parameters), mind crime might occur, acausal trading might occur, the hardware of the computer might be used to implement effectors in some fantastic way. And those might be just a tiny fraction of a large class of relevant risks that the majority of which we can’t currently understand.
Such ‘offline computers’ might be more dangerous than an RL agent that by design controls some actuators, because problems with the latter might be visible to us at a much lower scale of training (and therefore with much less capable/intelligent systems).