I don’t think I was very clear; let me try to explain.
I mean different things by “intentions” and “terminal values” (and I think you do too?)
By “terminal values” I’m thinking of something like a reward function. If we literally just program an AI to have a particular reward function, then we know that it’s terminal values are whatever that reward function expresses.
Whereas “trying to do what H wants it to do” I think encompasses a broader range of things, such as when R has uncertainty about the reward function, but “wants to learn the right one”, or really just any case where R could reasonably be described as “trying to do what H wants it to do”.
Talking about a “black box system” was probably a red herring.
Do I terminally value reproduction? What if we run an RL algorithm with genetic fitness as its reward function?
If we define terminal values in the way you specify, but then “aligned” (whether parochially or holistically) does not seem like a strong enough condition to actually make us happy about an AI. Indeed, most people at MIRI seem to think that most of the difficulty of alignment is getting from “has X as explicit terminal goal” to “is actually trying to achieve X.”
I realized it’s unclear to me what “trying” means here, and in your definition of intentional alignment. I get the sense that you mean something much weaker than MIRI does by “(actually) trying”, and/or that you think this is a lot easier to accomplish than they do. Can you help clarify?
To the extent that daemons result from an AI actually doing a good job of optimizing the right reward function, I think we should just accept that as the best possible outcome.
To the extent that daemons result from an AI doing a bad job of optimizing the right reward function, that can be viewed as a problem with capabilities, not alignment. That doesn’t mean we should ignore such problems; it’s just out of scope.
Indeed, most people at MIRI seem to think that most of the difficulty of alignment is getting from “has X as explicit terminal goal” to “is actually trying to achieve X.”
That seems like the wrong way of phrasing it to me. I would put it like “MIRI wants to figure out how to build properly ‘consequentialist’ agents, a capability they view us as currently lacking”.
I don’t think I was very clear; let me try to explain.
I mean different things by “intentions” and “terminal values” (and I think you do too?)
By “terminal values” I’m thinking of something like a reward function. If we literally just program an AI to have a particular reward function, then we know that it’s terminal values are whatever that reward function expresses.
Whereas “trying to do what H wants it to do” I think encompasses a broader range of things, such as when R has uncertainty about the reward function, but “wants to learn the right one”, or really just any case where R could reasonably be described as “trying to do what H wants it to do”.
Talking about a “black box system” was probably a red herring.
Do I terminally value reproduction? What if we run an RL algorithm with genetic fitness as its reward function?
If we define terminal values in the way you specify, but then “aligned” (whether parochially or holistically) does not seem like a strong enough condition to actually make us happy about an AI. Indeed, most people at MIRI seem to think that most of the difficulty of alignment is getting from “has X as explicit terminal goal” to “is actually trying to achieve X.”
I realized it’s unclear to me what “trying” means here, and in your definition of intentional alignment. I get the sense that you mean something much weaker than MIRI does by “(actually) trying”, and/or that you think this is a lot easier to accomplish than they do. Can you help clarify?
It seems like you are referring to daemons.
To the extent that daemons result from an AI actually doing a good job of optimizing the right reward function, I think we should just accept that as the best possible outcome.
To the extent that daemons result from an AI doing a bad job of optimizing the right reward function, that can be viewed as a problem with capabilities, not alignment. That doesn’t mean we should ignore such problems; it’s just out of scope.
That seems like the wrong way of phrasing it to me. I would put it like “MIRI wants to figure out how to build properly ‘consequentialist’ agents, a capability they view us as currently lacking”.