To me, the important safety feature of “microscope AI” is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit). This feature is totally incompatible with agents (you can’t vacuum the floor without modeling the consequences of your motor control settings), and optional for oracles [I’m using oracles in a broad sense of systems that you use to help answer your questions, leaving aside what their exact user interface is, so microscope AI is part of that]. For example, when Eliezer thinks about oracles he is not thinking this way; instead, he’s thinking of a system that deliberately chooses an output to “increase the correspondence between the user’s belief about relevant consequences and reality”. But there’s no reason in principle that we couldn’t build a system that will not apply its intelligent world-model to analyze the downstream consequences of its outputs.
I think the only way to do that is to have its user interface not be created automatically as part of the training objective, but rather build the in ourselves, separately. Then the two key questions are: What’s the safe training procedure that results in an intelligent world-model, and what’s the separate input-output interface that we’re going to build? Both of these are open questions AFAIK. I wrote Self-Supervised Learning and AGI Safety laying out this big picture as I see it.
For the latter question, what is the user interface, “Use interpretability tools & visualizations on the world-model” seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision. I hope that they don’t stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a “search through causal pathways to get desired consequences” interface. Incidentally, the people who think that brain-computer interfaces will help with AGI safety (cf waitbutwhy) seem to be banking on something vaguely like “microscope AI”, but I haven’t yet found any detailed discussion along those lines.
For the first question, what is the safe training procedure that incidentally creates a world-model, contra Gurkenglas’s comment here, I think it’s an open question whether a safe training procedure exists. For example, unsupervised (a.k.a. “self-supervised”) learning as ofer suggests seems awfully safe but is it really? See Self-Supervised Learning and Manipulative Predictions; I half-joked there about burying the computer in an underground bunker, running self-supervised learning under homomorphic encryption, until training was complete; then cutting power, digging it out, and inspecting the world model. But even then, an ambitious misaligned system could potentially leave manipulative booby-traps on its hard drive. Gurkenglas’s suggestion of telling it nothing about the universe (e.g. have it play Nomic) would make it possibly safer but dramatically less useful (it won’t understand the cause of Alzheimer’s etc.) And it can probably still learn quite a bit about the world by observing its own algorithm… I’m not sure, I’m still generally optimistic that a solution exists, and I hope that Gurkenglas and I and everyone else keeps thinking about it. :-)
To me, the important safety feature of “microscope AI” is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit).
As I mentioned in this comment, not modeling the consequences of its output is actually exactly what I want to get out of myopia.
For the latter question, what is the user interface, “Use interpretability tools & visualizations on the world-model” seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision.
Yep; me too!
I hope that they don’t stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a “search through causal pathways to get desired consequences” interface.
Chris (and the rest of Clarity) are definitely working on stuff like this!
unsupervised (a.k.a. “self-supervised”) learning as ofer suggests seems awfully safe but is it really?
I generally agree that unsupervised learning seems much safer than other approaches (e.g. RL), though I also agree that there are still concerns. See for example Abram’s recent “The Parable of Predict-O-Matic” and the rest of his Partial Agency sequence.
To me, the important safety feature of “microscope AI” is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit). This feature is totally incompatible with agents (you can’t vacuum the floor without modeling the consequences of your motor control settings), and optional for oracles [I’m using oracles in a broad sense of systems that you use to help answer your questions, leaving aside what their exact user interface is, so microscope AI is part of that]. For example, when Eliezer thinks about oracles he is not thinking this way; instead, he’s thinking of a system that deliberately chooses an output to “increase the correspondence between the user’s belief about relevant consequences and reality”. But there’s no reason in principle that we couldn’t build a system that will not apply its intelligent world-model to analyze the downstream consequences of its outputs.
I think the only way to do that is to have its user interface not be created automatically as part of the training objective, but rather build the in ourselves, separately. Then the two key questions are: What’s the safe training procedure that results in an intelligent world-model, and what’s the separate input-output interface that we’re going to build? Both of these are open questions AFAIK. I wrote Self-Supervised Learning and AGI Safety laying out this big picture as I see it.
For the latter question, what is the user interface, “Use interpretability tools & visualizations on the world-model” seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision. I hope that they don’t stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a “search through causal pathways to get desired consequences” interface. Incidentally, the people who think that brain-computer interfaces will help with AGI safety (cf waitbutwhy) seem to be banking on something vaguely like “microscope AI”, but I haven’t yet found any detailed discussion along those lines.
For the first question, what is the safe training procedure that incidentally creates a world-model, contra Gurkenglas’s comment here, I think it’s an open question whether a safe training procedure exists. For example, unsupervised (a.k.a. “self-supervised”) learning as ofer suggests seems awfully safe but is it really? See Self-Supervised Learning and Manipulative Predictions; I half-joked there about burying the computer in an underground bunker, running self-supervised learning under homomorphic encryption, until training was complete; then cutting power, digging it out, and inspecting the world model. But even then, an ambitious misaligned system could potentially leave manipulative booby-traps on its hard drive. Gurkenglas’s suggestion of telling it nothing about the universe (e.g. have it play Nomic) would make it possibly safer but dramatically less useful (it won’t understand the cause of Alzheimer’s etc.) And it can probably still learn quite a bit about the world by observing its own algorithm… I’m not sure, I’m still generally optimistic that a solution exists, and I hope that Gurkenglas and I and everyone else keeps thinking about it. :-)
As I mentioned in this comment, not modeling the consequences of its output is actually exactly what I want to get out of myopia.
Yep; me too!
Chris (and the rest of Clarity) are definitely working on stuff like this!
I generally agree that unsupervised learning seems much safer than other approaches (e.g. RL), though I also agree that there are still concerns. See for example Abram’s recent “The Parable of Predict-O-Matic” and the rest of his Partial Agency sequence.