I can see the argument of capabilities vs safety both ways. On the one hand, by working on capabilities, we may get some insights. We could figure out how much data is a factor, and what kinds of data they need to be. We could figure out how long term planning emerges, and try our hand at inserting transparency into the model. We can figure out whether the system will need separate modules for world modeling vs reward modeling. On the other hand, if intelligence turns out to be not that hard, and all we need to do is train a giant decision transformer… then we have major problems.
I think it would be great to focus capabilities research into a narrower space as Razied says. My hunch is that a giant language model by itself would not go foom, because it’s not really optimizing for anything other than predicting the next token. It’s not even really aware of the passage of time. I can’t imagine it having a drive to, for example, make the world output only a single word forever. I think the danger would be in trying to make it into an agent.
I also think that there must be alignment work that can be done without knowing the exact nature of the final product. For example, learning the human value function, whether it comes from a brain-like formulation, or inverse RL. I am also curious if there has been work done on trying to find a “least bad” nondegenerate value function, i.e. one that doesn’t kill us, torture us, or tile the universe with junk, even if it does not necessarily want what we want perfectly. I think relevant safety work can always take the form of, “suppose current technology scaled up (e.g. decision transformer) could go foom, what should we do right now that could constrain it?” There is some risk that future advancements could be very different, and work done in this stage is not directly applicable, but I imagine it would still be useful somehow. Also, my intuition is that we could always wonder what’s the next step in capabilities, until the final step, and we may not know it’s the final step.
One thing you have to admit, though. Capabilities research is just plain exciting, probably on the same level as working on the Manhattan project was exciting. I mean, who doesn’t want to know how intelligence works?
I can see the argument of capabilities vs safety both ways. On the one hand, by working on capabilities, we may get some insights. We could figure out how much data is a factor, and what kinds of data they need to be. We could figure out how long term planning emerges, and try our hand at inserting transparency into the model. We can figure out whether the system will need separate modules for world modeling vs reward modeling. On the other hand, if intelligence turns out to be not that hard, and all we need to do is train a giant decision transformer… then we have major problems.
I think it would be great to focus capabilities research into a narrower space as Razied says. My hunch is that a giant language model by itself would not go foom, because it’s not really optimizing for anything other than predicting the next token. It’s not even really aware of the passage of time. I can’t imagine it having a drive to, for example, make the world output only a single word forever. I think the danger would be in trying to make it into an agent.
I also think that there must be alignment work that can be done without knowing the exact nature of the final product. For example, learning the human value function, whether it comes from a brain-like formulation, or inverse RL. I am also curious if there has been work done on trying to find a “least bad” nondegenerate value function, i.e. one that doesn’t kill us, torture us, or tile the universe with junk, even if it does not necessarily want what we want perfectly. I think relevant safety work can always take the form of, “suppose current technology scaled up (e.g. decision transformer) could go foom, what should we do right now that could constrain it?” There is some risk that future advancements could be very different, and work done in this stage is not directly applicable, but I imagine it would still be useful somehow. Also, my intuition is that we could always wonder what’s the next step in capabilities, until the final step, and we may not know it’s the final step.
One thing you have to admit, though. Capabilities research is just plain exciting, probably on the same level as working on the Manhattan project was exciting. I mean, who doesn’t want to know how intelligence works?