Alignment is fundamentally about human imitation. We want machines that resemble us in certain key aspects, but evidently not all. These aspects are mostly linked with inner motivations. Since current models are very complicated, alignment will only be feasible if we can target where and at what level of abstraction it must take place. Take the thoughts “Maya is cold. That cat is dear to me. I have a blanket. I will use it to keep it warm.” and “The family’s cat is cold. I don’t want to be in trouble for its death. I will use the blanket to keep it warm.” Even though these two thoughts produce the same actions, there is a stark difference in motivation. There can also be different reasoning and actions, but similar intentions: “The cat is freezing. I don’t want the cat to die, for I like him. Therefore, I will raise the thermostat to keep it warm.” Intentions are much easier to know if you have access to the thoughts. Therefore, we must align our models by looking at the thoughts behind their actions. If you cannot hide your thoughts well, you cannot hide your motivations well either, for the former is seeded in the latter. Our neural networks are irreducibly complicated. We will never fully understand the computation that they are doing by looking at their weights. But take the thought “I am very cold and I have a sweater. The family’s cat, Maya, is shivering. Therefore, she is cold. Encapsulating a being can reduce the outward flow of heat. This blanket can adapt its form to objects I put it unto. Therefore, I will place it on the cat to keep it warm.” Going down a level of abstraction did not provide more information about the intentions of the cat owner. Therefore, we can avoid work by only targeting a higher level of thought. No need to understand every parameter in the network. It should be intuitive that most thoughts are not formed all at once. They are formed step by step. You work to form conclusions, or intermediate thoughts. Once these have been formed, you take the results as a stepping stone to reach a vantage point from which you will continue the chain, without redoing the computation every time (this is why mathematics is seen as a platonic ideal of thinking: a proof is a label that tells you that a stepping stone will be eternally solid). If we train our models to generate these intermediate thoughts in a readable format, we can apply reinforcement learning with human feedback to align them. Then our agents must be trained such that their strong capabilities are only reached by forming chains of thoughts that are verifiable by us. In order for training to remain fast, RLHF should only be applied at intermediate iterations during the training process, but with big learning rates. The rest of the time, training should only be applied on the outputs.
Alignment in Thought Chains
Alignment is fundamentally about human imitation. We want machines that resemble us in certain key aspects, but evidently not all. These aspects are mostly linked with inner motivations. Since current models are very complicated, alignment will only be feasible if we can target where and at what level of abstraction it must take place. Take the thoughts “Maya is cold. That cat is dear to me. I have a blanket. I will use it to keep it warm.” and “The family’s cat is cold. I don’t want to be in trouble for its death. I will use the blanket to keep it warm.” Even though these two thoughts produce the same actions, there is a stark difference in motivation. There can also be different reasoning and actions, but similar intentions: “The cat is freezing. I don’t want the cat to die, for I like him. Therefore, I will raise the thermostat to keep it warm.” Intentions are much easier to know if you have access to the thoughts. Therefore, we must align our models by looking at the thoughts behind their actions. If you cannot hide your thoughts well, you cannot hide your motivations well either, for the former is seeded in the latter. Our neural networks are irreducibly complicated. We will never fully understand the computation that they are doing by looking at their weights. But take the thought “I am very cold and I have a sweater. The family’s cat, Maya, is shivering. Therefore, she is cold. Encapsulating a being can reduce the outward flow of heat. This blanket can adapt its form to objects I put it unto. Therefore, I will place it on the cat to keep it warm.” Going down a level of abstraction did not provide more information about the intentions of the cat owner. Therefore, we can avoid work by only targeting a higher level of thought. No need to understand every parameter in the network. It should be intuitive that most thoughts are not formed all at once. They are formed step by step. You work to form conclusions, or intermediate thoughts. Once these have been formed, you take the results as a stepping stone to reach a vantage point from which you will continue the chain, without redoing the computation every time (this is why mathematics is seen as a platonic ideal of thinking: a proof is a label that tells you that a stepping stone will be eternally solid). If we train our models to generate these intermediate thoughts in a readable format, we can apply reinforcement learning with human feedback to align them. Then our agents must be trained such that their strong capabilities are only reached by forming chains of thoughts that are verifiable by us. In order for training to remain fast, RLHF should only be applied at intermediate iterations during the training process, but with big learning rates. The rest of the time, training should only be applied on the outputs.