I think it’s important to distinguish between ambitious and narrow value learning here. It does seem plausible that many/most narrow values do exist at the initial time step, so something like IRL should be able to recover them. On the other hand, preferences over long-term outcomes probably don’t exist at the initial time in enough detail to act on.
IMO the main problem with ambitious value learning is that the only plausible way of doing it goes through a trusted reflection process (e.g. HCH, or having the AI doing philosophy using trusted methods). And if we trust the reflection process to construct preferences over long-term outcomes, we might as well use it to directly decide what actions to take, so ambitious value learning is FAI-complete. (In other words, there isn’t a clear advantage to asking the reflection process “how valuable is X” instead of “which action should the AI take”; they seem about as difficult to answer correctly).
IMO, the main problem with narrow value learning is that there isn’t a very good story for how an agent that is smarter than its overseers can pursue its overseers’ instrumental values, given that its overseers’ instrumental values are incoherent from its perspective; this seems related to the hard problem of corrigibility. One way to resolve this is to make sure the overseer is smarter than the value-learning agent at each step, in which case narrow value learning is an implementation strategy for ALBA (and the entire setup inherits ALBA’s difficulties). Another way is to figure out how the AI can pursue the instrumental values of an agent weaker than itself.
I am curious whether you are thinking more of ambitious or narrow value learning when you write posts like this one.
I’m thinking counterfactually (that’s a subsequent post, which replaces the “stratified learning” one), so the thing that distinguishes ambitious from narrow learning is that narrow learning is the same in many counterfactual situations, while ambitious learning is much more floppy/dependent on the details of the counterfactual.
I think it’s important to distinguish between ambitious and narrow value learning here. It does seem plausible that many/most narrow values do exist at the initial time step, so something like IRL should be able to recover them. On the other hand, preferences over long-term outcomes probably don’t exist at the initial time in enough detail to act on.
IMO the main problem with ambitious value learning is that the only plausible way of doing it goes through a trusted reflection process (e.g. HCH, or having the AI doing philosophy using trusted methods). And if we trust the reflection process to construct preferences over long-term outcomes, we might as well use it to directly decide what actions to take, so ambitious value learning is FAI-complete. (In other words, there isn’t a clear advantage to asking the reflection process “how valuable is X” instead of “which action should the AI take”; they seem about as difficult to answer correctly).
IMO, the main problem with narrow value learning is that there isn’t a very good story for how an agent that is smarter than its overseers can pursue its overseers’ instrumental values, given that its overseers’ instrumental values are incoherent from its perspective; this seems related to the hard problem of corrigibility. One way to resolve this is to make sure the overseer is smarter than the value-learning agent at each step, in which case narrow value learning is an implementation strategy for ALBA (and the entire setup inherits ALBA’s difficulties). Another way is to figure out how the AI can pursue the instrumental values of an agent weaker than itself.
I am curious whether you are thinking more of ambitious or narrow value learning when you write posts like this one.
I’m thinking counterfactually (that’s a subsequent post, which replaces the “stratified learning” one), so the thing that distinguishes ambitious from narrow learning is that narrow learning is the same in many counterfactual situations, while ambitious learning is much more floppy/dependent on the details of the counterfactual.
OK, I didn’t understand this comment at all but maybe I should wait until you post on counterfactuals.