I definitely feel your points about social dynamics having a negative influence on the quality of alignment thinking. Long time LW alignment researchers have status in our social circle. The social gradient pushes towards imitating and impressing them, as opposed to directly pursuing the lines of thought that seem most fruitful to you.
The instinct is to think in the frames of higher status researchers and to ensure your work is defensible under someone else’s frame of alignment. This will never be as efficient / natural as thinking in your own frames, and will invariably push the community towards more conformity with the views of higher-status researchers. Note that the social conformation pressure affects both internal thoughts, and externally expressed opinions, and so it is doubly crippling because it reduces your ability to both think original thoughts and to communicate those original thoughts in your own frame.
I also feel like I’ve gotten sharper recently. I feel like I’ve made connections that I’m not sure the Quintin of a year ago would have spotted. E.g., “values as modular factorizations of a utility function” or “variance in human alignment implies the generators of human alignment can be optimized for more alignment”. I’ve also had moments of introspective awareness into my own cognitive algorithms that seem clearer than had been typical for me.
I can’t describe how I’ve gotten sharper in as much detail as you can. I think one of the bigger improvements in my own thinking was when I finally grokked that high status alignment researchers can actually be wrong about alignment, and that they can be wrong in huge, important, and obvious-seeming-to-me ways. If you think a higher status person is making a simple mistake, social conformity bias will push you in two incredibly unhelpful directions:
Assume the higher status person is right, so as to avoid the possible status hit or potential ementy that might come from contradicting a higher status person.
Complicate the mistake that the higher status person seems to be making, so as to avoid “insulting” the higher status person by claiming they made a simple mistake.
These are both, of course, very bad if your actual goal is to accurately identify mistakes in another person’s thinking.
Ironically, I think that another factor in improving my reasoning was that I’ve moved away from trying to force my thoughts to have a “Bayesian” style (or what I imagined as a “Bayesian” style). Previously, I’d been worried about how to correctly update my credence in discrete propositions in the light of new evidence. E.g., “will human level alignment scale to superintelligence? Yes or no?” Now, I instead think in terms of updating an ensemble of latent abstractions over the generators of my observations.
I.e., there’s some distribution over possible learning processes and their downstream alignment properties. I’m trying to model this distribution with abstractions, and I use my observations about humans and their alignment properties as empirical evidence to update my abstractions over learning processes and their alignment properties. The most important thing to think about isn’t the (poorly specified) first order question of how well human alignment scales, but the deeper questions about the generators of my current observations.
(I realize that this is still a fundamentally Bayesian way of thinking about inference. But speaking for myself, the admittedly limited concept I had of what “proper Bayesian reasoning” ought to look like was something of a roadblock to acquiring what I now consider to be an improved inferential process.)
Congratulations on finishing your PhD!
I definitely feel your points about social dynamics having a negative influence on the quality of alignment thinking. Long time LW alignment researchers have status in our social circle. The social gradient pushes towards imitating and impressing them, as opposed to directly pursuing the lines of thought that seem most fruitful to you.
The instinct is to think in the frames of higher status researchers and to ensure your work is defensible under someone else’s frame of alignment. This will never be as efficient / natural as thinking in your own frames, and will invariably push the community towards more conformity with the views of higher-status researchers. Note that the social conformation pressure affects both internal thoughts, and externally expressed opinions, and so it is doubly crippling because it reduces your ability to both think original thoughts and to communicate those original thoughts in your own frame.
I also feel like I’ve gotten sharper recently. I feel like I’ve made connections that I’m not sure the Quintin of a year ago would have spotted. E.g., “values as modular factorizations of a utility function” or “variance in human alignment implies the generators of human alignment can be optimized for more alignment”. I’ve also had moments of introspective awareness into my own cognitive algorithms that seem clearer than had been typical for me.
I can’t describe how I’ve gotten sharper in as much detail as you can. I think one of the bigger improvements in my own thinking was when I finally grokked that high status alignment researchers can actually be wrong about alignment, and that they can be wrong in huge, important, and obvious-seeming-to-me ways. If you think a higher status person is making a simple mistake, social conformity bias will push you in two incredibly unhelpful directions:
Assume the higher status person is right, so as to avoid the possible status hit or potential ementy that might come from contradicting a higher status person.
Complicate the mistake that the higher status person seems to be making, so as to avoid “insulting” the higher status person by claiming they made a simple mistake.
These are both, of course, very bad if your actual goal is to accurately identify mistakes in another person’s thinking.
Ironically, I think that another factor in improving my reasoning was that I’ve moved away from trying to force my thoughts to have a “Bayesian” style (or what I imagined as a “Bayesian” style). Previously, I’d been worried about how to correctly update my credence in discrete propositions in the light of new evidence. E.g., “will human level alignment scale to superintelligence? Yes or no?” Now, I instead think in terms of updating an ensemble of latent abstractions over the generators of my observations.
I.e., there’s some distribution over possible learning processes and their downstream alignment properties. I’m trying to model this distribution with abstractions, and I use my observations about humans and their alignment properties as empirical evidence to update my abstractions over learning processes and their alignment properties. The most important thing to think about isn’t the (poorly specified) first order question of how well human alignment scales, but the deeper questions about the generators of my current observations.
(I realize that this is still a fundamentally Bayesian way of thinking about inference. But speaking for myself, the admittedly limited concept I had of what “proper Bayesian reasoning” ought to look like was something of a roadblock to acquiring what I now consider to be an improved inferential process.)