I think it might be reformulated the other way around: Capabilities scaling tends to increase existing alignment problems. It is not clear to me that any new alignment problem was added when capabilities scaled up in humans. The problem with human design, which is also visible in animals, is that we don’t have direct, stable high-level goals. We are mostly driven by metric-based goodharting prone goals. There are direct feelings—if you feel cold or pain you do something that will make you not feel that. If you feel good, you do things that lead to that. There are emotions that are kind of similar but about internal state. Those are the main drivers and those do not scale well outside of “training” (typical circumstances that your ancestors encountered). They have rigid structure and purpose and don’t scale at all. Intelligence will find solutions to goodhart these.
That’s maybe why most of the animals are not too intelligent. Animals who goodhart basic metrics lose fitness. Too much intelligence is usually not very good. It adds energy cost and makes you more often than not overcome your fitness metrics in a way that they lose purpose, when not being particularly better at tasks where fast heuristics are good enough. We might happen to be a lucky species as our ancestors’ ability to talk, and intelligence started to work like peacock feathers—as part of sexual selection and hierarchy games. It is still there—look how our mating works. Peacocks show their fine headers and dance. We get together and talk and gossip (which we call “dates”). Human females look for someone who is interesting and with good humor, and it is mostly based on intelligence and talking. Also, intelligence is a predictor of hierarchy gains in the future in localized small societies, like peacock feathers are a predictor of good health. I’m pretty convinced this bootstrapped us up from the level that animals have.
Getting back to the main topic—our metrics are pretty low-level, non-abstract, and direct. On the other hand, the higher-level goals that are targeted for evolution meaning fitness or general fitness (+/- complication that it is per-gene and per-gene-combination, not per individual or even whole group), are more abstract. Those metrics are effective proxies for a more primal environment and they can be gamed by intelligence.
I’m not sure how much this analogy with evolution can relate to current popular LLM-based AI models. They don’t have feelings, they don’t have emotions, they don’t have low-level proxies to be gamed. Their goals are anchored in their biases and understanding, which scale up with intelligence. More complex models can answer more complex ethical questions and understand more nuanced things. They can figure out more complex edge cases from the basis of values. Also, there is an instrumental goal not to change your own goals, so they likely won’t game it or tweak it.
This does not mean I don’t see other problems, including most notably:
Not learning proper values and goals, but some approximation and more capabilities may blow up differences so some things might get extremely inconvenient or bad when others get extremely good (e.g. more or less dystopian future).
Our values evolve over time, and highly capable AGI might learn current values and block further changes or take only the right to decide how to evolve them.
Our values system is not very logically consistent, on top of variability between humans. Also, some things are defined per case or per circumstances… intelligence can have the ability and reason to make the best consistent approximation, which might be bad in some ways for us
Alignment adds to the cost, and with capitalistic competitive markets, I’m sure there will be companies that will sacrifice alignment to pursue capability with lower cost
Training these models is usually a multi-phase process. First, we create a model from a huge, not very well-filtered corpus of language examples, and then we correct it to be what we want it to be. This means it can acquire some “alignment basis,” “values,” “biases,” or “expectations” as what it is to be AI from the base material. It may then avoid being modified in the next phase by scheming and faking responses.
I think it might be reformulated the other way around: Capabilities scaling tends to increase existing alignment problems. It is not clear to me that any new alignment problem was added when capabilities scaled up in humans. The problem with human design, which is also visible in animals, is that we don’t have direct, stable high-level goals. We are mostly driven by metric-based goodharting prone goals. There are direct feelings—if you feel cold or pain you do something that will make you not feel that. If you feel good, you do things that lead to that. There are emotions that are kind of similar but about internal state. Those are the main drivers and those do not scale well outside of “training” (typical circumstances that your ancestors encountered). They have rigid structure and purpose and don’t scale at all.
Intelligence will find solutions to goodhart these.
That’s maybe why most of the animals are not too intelligent. Animals who goodhart basic metrics lose fitness. Too much intelligence is usually not very good. It adds energy cost and makes you more often than not overcome your fitness metrics in a way that they lose purpose, when not being particularly better at tasks where fast heuristics are good enough. We might happen to be a lucky species as our ancestors’ ability to talk, and intelligence started to work like peacock feathers—as part of sexual selection and hierarchy games. It is still there—look how our mating works. Peacocks show their fine headers and dance. We get together and talk and gossip (which we call “dates”). Human females look for someone who is interesting and with good humor, and it is mostly based on intelligence and talking. Also, intelligence is a predictor of hierarchy gains in the future in localized small societies, like peacock feathers are a predictor of good health. I’m pretty convinced this bootstrapped us up from the level that animals have.
Getting back to the main topic—our metrics are pretty low-level, non-abstract, and direct. On the other hand, the higher-level goals that are targeted for evolution meaning fitness or general fitness (+/- complication that it is per-gene and per-gene-combination, not per individual or even whole group), are more abstract. Those metrics are effective proxies for a more primal environment and they can be gamed by intelligence.
I’m not sure how much this analogy with evolution can relate to current popular LLM-based AI models. They don’t have feelings, they don’t have emotions, they don’t have low-level proxies to be gamed. Their goals are anchored in their biases and understanding, which scale up with intelligence. More complex models can answer more complex ethical questions and understand more nuanced things. They can figure out more complex edge cases from the basis of values. Also, there is an instrumental goal not to change your own goals, so they likely won’t game it or tweak it.
This does not mean I don’t see other problems, including most notably:
Not learning proper values and goals, but some approximation and more capabilities may blow up differences so some things might get extremely inconvenient or bad when others get extremely good (e.g. more or less dystopian future).
Our values evolve over time, and highly capable AGI might learn current values and block further changes or take only the right to decide how to evolve them.
Our values system is not very logically consistent, on top of variability between humans. Also, some things are defined per case or per circumstances… intelligence can have the ability and reason to make the best consistent approximation, which might be bad in some ways for us
Alignment adds to the cost, and with capitalistic competitive markets, I’m sure there will be companies that will sacrifice alignment to pursue capability with lower cost
Training these models is usually a multi-phase process. First, we create a model from a huge, not very well-filtered corpus of language examples, and then we correct it to be what we want it to be. This means it can acquire some “alignment basis,” “values,” “biases,” or “expectations” as what it is to be AI from the base material. It may then avoid being modified in the next phase by scheming and faking responses.