[My friend suggested that I read this for a discussion we were going to have. Originally I was going to write up some thoughts on it in an email to him, but I decided to make it a comment in case having it be publicly available generates value for others. But I’m not going to spend time polishing it since this post is 5 months old and I don’t expect many people to see it. Alex, if you read this, please tell me if reading it felt more effective than having an in-person discussion.]
OK, but doesn’t this only incentivize it to appear like it’s doing what the operator wants? Couldn’t it optimize for hijacking its reward signal, while seeming to act in ways that humans are happy with?
We’re not just training the agent to take good actions. We’re also training it to comprehensibly answer questions about why it took the actions it took, to arbitrary levels of detail. (Imagine a meticulous boss grilling an employee about a report he put together, or a tax auditor grilling a corporation about the minutiae of its expenses.) We ensure alignment by randomly performing thorough evaluations of its justifications for its actions, and punishing it severely if any of those justifications seem subversive. To the extent we trust these justifications to accurately reflect the agent’s cognition, we can trust the agent to not act subversively (and thus be aligned).
I think this is too pessimistic. “Reward signal” is slipping in some assumptions about the system’s reward architecture. Why is it that I don’t “hijack my reward signal” by asking a neurosurgeon to stimulate my pleasure center? Because when I simulate the effect of that using my current reward architecture, the simulation is unfavorable.
Instead of auditing the cognition, I’d prefer the system have correctly calibrated uncertainty and ask humans for clarification regarding specific data points in order to learn stuff (active learning).
Re: “Why should we expect the agent’s answers to correspond to its cognition at all?” I like Cynthia Rudin’s comparison of explainable ML and interpretable ML. A system where the objective function has been changed in order to favor simplicity/interpretability (see e.g. sparse representation theory?) seems more robust with fewer moving parts, and probably will also be a more effective ML algorithm since simple models generalize better.
In other words, the amplified agent randomly “audits” the distilled agent, and punishes the distilled agent very harshly if it fails the audit. Though the distilled agent knows that it might be able to deceive its supervisor when it isn’t audited, it’s so scared of the outcome where it tries to do that and gets audited that it doesn’t even want to try. (Even if you were 99% confident that you could get away with tax evasion, you wouldn’t want to try if you knew the government tortures and murders the families of the tax evaders they catch.)
This paragraph seems unnecessarily anthropomorphic. Why not just say: The distilled agent only runs computations if it is at least 99.9% confident that the computation is aligned. Or: To determine whether to run a particular computation, our software uses the following expected value formula:
(When we talk about our programs “wanting” to do various things, we are anthropomorphizing. For example, I could talk about a single-page Javascript app and say: when the internet is slow, the program “wants” to present a good interface to the user even without being able to gather data from the server. In the same way, I could say that a friendly AI “wants” to be aligned. The alignment problem is not due to a lack of desire on the part of the AI to be aligned. The alignment problem is due to the AI having mistaken beliefs about what alignment is.)
I like the section on “epistemically corrigible humans”. I’ve spent a fair amount of time thinking about “epistemically corrigible” meta-learning (no agent—just thinking in terms of the traditional supervised/unsupervised learning paradigms, and just trying to learn the truth, not be particularly helpful—although if you have a learning algorithm that’s really good at finding the truth, one particular thing you could use it to find the truth about is which actions humans approve of). I’m now reasonably confident that it’s doable, and deprioritizing this a little bit in order to figure out if there’s a more important bottleneck in my vision of how FAI could work elsewhere.
“Subproblems of “worst-case guarantees” include ensuring that ML systems are robust to distributional shift and adversarial inputs, which are also broad open questions in ML, and which might require substantial progress on MIRI-style research to articulate and prove formal bounds.”
By “MIRI-style research” do we mean HRAD? I thought HRAD was notable because it did not attack machine learning problems like distributional shift, adversarial inputs, etc.
Overall, I think Paul’s agenda makes AI alignment a lot harder than it needs to be, and his solution sounds a lot more complicated than I think an FAI needs to be. I think “Deep reinforcement learning from human preferences” is already on the right track, and I don’t understand why we need all this additional complexity in order to learn corrigibility and stuff. Complexity matters because an overcomplicated solution is more likely to have bugs. I think instead of overbuilding for intuitions based on anthropomorphism, we should work harder to clarify exactly what the failure modes could be for AI and focus more on finding the simplest solution that is not susceptible to any known failure modes. (The best way to make a solution simple is to kill multiple birds with one stone, so by gathering stones that are capable of killing multiple known birds, you maximize the probability that those stones also kill unknown birds.)
Also, I have an intuition that any approach to FAI that is appears overly customized for FAI is going to be the wrong one. For example, I feel “epistemic corrigibility” is a better problem to study than “corrigibility” because it is a more theoretically pure problem—the notion of truth (beliefs that make correct predictions) is far easier than the notion of human values. (And once you have epistemic corrigibility, I think it’s easy to use that to create a system which is corrigibility in the sense we need. And epistemic corrigibility kinda a reframed version of what mainstream ML researchers already look for when they consider the merits of various ML algorithms.)
[My friend suggested that I read this for a discussion we were going to have. Originally I was going to write up some thoughts on it in an email to him, but I decided to make it a comment in case having it be publicly available generates value for others. But I’m not going to spend time polishing it since this post is 5 months old and I don’t expect many people to see it. Alex, if you read this, please tell me if reading it felt more effective than having an in-person discussion.]
I think this is too pessimistic. “Reward signal” is slipping in some assumptions about the system’s reward architecture. Why is it that I don’t “hijack my reward signal” by asking a neurosurgeon to stimulate my pleasure center? Because when I simulate the effect of that using my current reward architecture, the simulation is unfavorable.
Instead of auditing the cognition, I’d prefer the system have correctly calibrated uncertainty and ask humans for clarification regarding specific data points in order to learn stuff (active learning).
Re: “Why should we expect the agent’s answers to correspond to its cognition at all?” I like Cynthia Rudin’s comparison of explainable ML and interpretable ML. A system where the objective function has been changed in order to favor simplicity/interpretability (see e.g. sparse representation theory?) seems more robust with fewer moving parts, and probably will also be a more effective ML algorithm since simple models generalize better.
This paragraph seems unnecessarily anthropomorphic. Why not just say: The distilled agent only runs computations if it is at least 99.9% confident that the computation is aligned. Or: To determine whether to run a particular computation, our software uses the following expected value formula:
probability_computation_is_malign * malign_penalty + probability_computation_is_not_malign * benefit_conditional_on_computation_not_being_malign
(When we talk about our programs “wanting” to do various things, we are anthropomorphizing. For example, I could talk about a single-page Javascript app and say: when the internet is slow, the program “wants” to present a good interface to the user even without being able to gather data from the server. In the same way, I could say that a friendly AI “wants” to be aligned. The alignment problem is not due to a lack of desire on the part of the AI to be aligned. The alignment problem is due to the AI having mistaken beliefs about what alignment is.)
I like the section on “epistemically corrigible humans”. I’ve spent a fair amount of time thinking about “epistemically corrigible” meta-learning (no agent—just thinking in terms of the traditional supervised/unsupervised learning paradigms, and just trying to learn the truth, not be particularly helpful—although if you have a learning algorithm that’s really good at finding the truth, one particular thing you could use it to find the truth about is which actions humans approve of). I’m now reasonably confident that it’s doable, and deprioritizing this a little bit in order to figure out if there’s a more important bottleneck in my vision of how FAI could work elsewhere.
“Subproblems of “worst-case guarantees” include ensuring that ML systems are robust to distributional shift and adversarial inputs, which are also broad open questions in ML, and which might require substantial progress on MIRI-style research to articulate and prove formal bounds.”
By “MIRI-style research” do we mean HRAD? I thought HRAD was notable because it did not attack machine learning problems like distributional shift, adversarial inputs, etc.
Overall, I think Paul’s agenda makes AI alignment a lot harder than it needs to be, and his solution sounds a lot more complicated than I think an FAI needs to be. I think “Deep reinforcement learning from human preferences” is already on the right track, and I don’t understand why we need all this additional complexity in order to learn corrigibility and stuff. Complexity matters because an overcomplicated solution is more likely to have bugs. I think instead of overbuilding for intuitions based on anthropomorphism, we should work harder to clarify exactly what the failure modes could be for AI and focus more on finding the simplest solution that is not susceptible to any known failure modes. (The best way to make a solution simple is to kill multiple birds with one stone, so by gathering stones that are capable of killing multiple known birds, you maximize the probability that those stones also kill unknown birds.)
Also, I have an intuition that any approach to FAI that is appears overly customized for FAI is going to be the wrong one. For example, I feel “epistemic corrigibility” is a better problem to study than “corrigibility” because it is a more theoretically pure problem—the notion of truth (beliefs that make correct predictions) is far easier than the notion of human values. (And once you have epistemic corrigibility, I think it’s easy to use that to create a system which is corrigibility in the sense we need. And epistemic corrigibility kinda a reframed version of what mainstream ML researchers already look for when they consider the merits of various ML algorithms.)