Meta: I really like ideas and concrete steps for how to practice the skill of thinking about something. I think there are at least three methods for learning how to think productively about a particular problem:
Reading material written by others (not necessarily passively; this might include checking your understanding of the material you read)
Doing exercises, practice problems, toy projects, etc. that are focused directly on the object-level problem.
Doing exercises to practice the “cognitive motions” needed to think productively about a problem. I think the exercise(s) described in this post are a good example of this.
And I think the last option is often neglected (in all fields, not just AGI safety) because there’s not a lot of written material on how to actually do it. Note that it is a different thing than the more general skill of learning to learn and meta-cognition—different kinds of technical problems can require learning different, domain-specific kinds of cognitive motions.
If you’ve absorbed enough of the Sequences and other rationality material through osmosis, you might be able to figure out the kind of cognitive motions you need, and how to practice and develop them on your own (or maybe you’ve done some of the exercises in the CFAR handbook and learned to generalize the lessons they try to teach).
But having someone more experienced write down the kind of cognitive motions you need, along with exercises for how to learn and practice them, can probably get more people up to speed much more quickly. I think posts like this are a great step in that direction.
Object-level tip for the breaker phase: thinking about how a literal human might break your alignment proposal can be a useful way for building intuitions and security mindset. A lot of real alignment schemes involve doing something with human-ish level intelligence, and thinking about how an actual human would break or escape from something is often more natural and less prone to veering into vague or magical thinking than positing capabilities that a hypothetical super-intelligent AI system might have.
If you can’t figure out how an actual human can break things, you can relax the constraint a bit by thinking about what a human with the ability to make 10 copies of themselves, think 10x as fast, write code with superhuman accuracy and speed, etc. could do instead.
Threat modelling is the term for this kind of thinking in the field of computer security.
Meta: I really like ideas and concrete steps for how to practice the skill of thinking about something. I think there are at least three methods for learning how to think productively about a particular problem:
Reading material written by others (not necessarily passively; this might include checking your understanding of the material you read)
Doing exercises, practice problems, toy projects, etc. that are focused directly on the object-level problem.
Doing exercises to practice the “cognitive motions” needed to think productively about a problem. I think the exercise(s) described in this post are a good example of this.
And I think the last option is often neglected (in all fields, not just AGI safety) because there’s not a lot of written material on how to actually do it. Note that it is a different thing than the more general skill of learning to learn and meta-cognition—different kinds of technical problems can require learning different, domain-specific kinds of cognitive motions.
If you’ve absorbed enough of the Sequences and other rationality material through osmosis, you might be able to figure out the kind of cognitive motions you need, and how to practice and develop them on your own (or maybe you’ve done some of the exercises in the CFAR handbook and learned to generalize the lessons they try to teach).
But having someone more experienced write down the kind of cognitive motions you need, along with exercises for how to learn and practice them, can probably get more people up to speed much more quickly. I think posts like this are a great step in that direction.
Object-level tip for the breaker phase: thinking about how a literal human might break your alignment proposal can be a useful way for building intuitions and security mindset. A lot of real alignment schemes involve doing something with human-ish level intelligence, and thinking about how an actual human would break or escape from something is often more natural and less prone to veering into vague or magical thinking than positing capabilities that a hypothetical super-intelligent AI system might have.
If you can’t figure out how an actual human can break things, you can relax the constraint a bit by thinking about what a human with the ability to make 10 copies of themselves, think 10x as fast, write code with superhuman accuracy and speed, etc. could do instead.
Threat modelling is the term for this kind of thinking in the field of computer security.