Interpretability will fail—future DL descendant is more of a black box, not less
It certainly makes interpretability harder, but it seems like the possible gain is also larger, making it a riskier bet overall. I’m not convinced that it decreases the expected value of interpretability research though. Do you have a good intuition for why it would make interpretability less valuable or at least lower value compared to the increased risk of failure?
IRL/Value Learning is far more difficult than first appearances suggest, see #2
That’s not immediately clear to me. Could you elaborate?
Do you have a good intuition for why it would make interpretability less valuable or at least lower value compared to the increased risk of failure.
Just discussed this above with Quintin Pope—it’s more a question of interpretability cost scaling unfavorably with complexity from RSO.
IRL/Value Learning is far more difficult than first appearances suggest, see #2
That’s not immediately clear to me. Could you elaborate?
First off, Value learning is essential for successful alignment. It really has two components though: learning the utility/value functions of external agents, and then substituting those as the agent’s own utility/value function. The first part we get for free, the second part is difficult because it conflicts with intrinsic motivation—somehow we need to transition from intrinsic motivation (which is what drives the value learning in the first place), to the learned external motivation. Getting this right was difficult for bio evolution, and seems far more difficult for future ANNs with a far more liquid RSO architecture. I have a future half-written post going deeper into this. I do think we can/should learn a great deal more about how altruism/empathy/VL works in the brain.
I’m not at all convinced of this. In fact, I suspect self-optimizing systems will be more interpretable (assuming we’re willing to bother putting any effort towards this goal). See my comment here making this case.
It certainly makes interpretability harder, but it seems like the possible gain is also larger, making it a riskier bet overall. I’m not convinced that it decreases the expected value of interpretability research though. Do you have a good intuition for why it would make interpretability less valuable or at least lower value compared to the increased risk of failure?
That’s not immediately clear to me. Could you elaborate?
Just discussed this above with Quintin Pope—it’s more a question of interpretability cost scaling unfavorably with complexity from RSO.
First off, Value learning is essential for successful alignment. It really has two components though: learning the utility/value functions of external agents, and then substituting those as the agent’s own utility/value function. The first part we get for free, the second part is difficult because it conflicts with intrinsic motivation—somehow we need to transition from intrinsic motivation (which is what drives the value learning in the first place), to the learned external motivation. Getting this right was difficult for bio evolution, and seems far more difficult for future ANNs with a far more liquid RSO architecture. I have a future half-written post going deeper into this. I do think we can/should learn a great deal more about how altruism/empathy/VL works in the brain.
I’m not at all convinced of this. In fact, I suspect self-optimizing systems will be more interpretable (assuming we’re willing to bother putting any effort towards this goal). See my comment here making this case.