In this post, I appreciated two ideas in particular:
Loss as chisel
Shard Theory
“Loss as chisel” is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can’t really argue with it and it doesn’t sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don’t know. Loss as a chisel just enables me to think better about the possibilities.
In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don’t know if it’s true, but it sounds like something that has to be investigated. In my understanding, some people consider it a “dead end,” and I’m not sure if it’s an active line of research or not at this point. My understanding of it is limited. I’m glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is.
The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don’t know if I agree with the thesis, but it’s something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: “Robust grading is unnecessary,” “the loss function doesn’t have to robustly and directly reflect what you want,” “inner alignment to a grading procedure is unnecessary, very hard, and anti-natural.”
I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and reinforcement learning work mechanistically.
I had an extremely brief irl conversation with Alex Turner a while before reading this post, in which he said he believed outer and inner alignment aren’t good frames. It was a response to me saying I wanted to cover inner and outer alignment on Rational Animations in depth. RA is still going to cover inner and outer alignment, but as a result of reading this post and the Training Stories system, I now think we should definitely also cover alternative frames and that I should read more about them.
I welcome corrections of any misunderstanding I may have of this post and related concepts.
In this post, I appreciated two ideas in particular:
Loss as chisel
Shard Theory
“Loss as chisel” is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can’t really argue with it and it doesn’t sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don’t know. Loss as a chisel just enables me to think better about the possibilities.
In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don’t know if it’s true, but it sounds like something that has to be investigated. In my understanding, some people consider it a “dead end,” and I’m not sure if it’s an active line of research or not at this point. My understanding of it is limited. I’m glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is.
The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don’t know if I agree with the thesis, but it’s something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: “Robust grading is unnecessary,” “the loss function doesn’t have to robustly and directly reflect what you want,” “inner alignment to a grading procedure is unnecessary, very hard, and anti-natural.”
I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and reinforcement learning work mechanistically.
I had an extremely brief irl conversation with Alex Turner a while before reading this post, in which he said he believed outer and inner alignment aren’t good frames. It was a response to me saying I wanted to cover inner and outer alignment on Rational Animations in depth. RA is still going to cover inner and outer alignment, but as a result of reading this post and the Training Stories system, I now think we should definitely also cover alternative frames and that I should read more about them.
I welcome corrections of any misunderstanding I may have of this post and related concepts.