I agree with everything you said. Seems that we should distinguish between a sort of “cooperative” and “adversarial” safety approaches (cf. the comment above). I wrote the entire post as an extended reply to Marc Carauleanu upon his mixed feedback to my idea of adding “selective SSM blocks for theory of mind” to increase the Self-Other Overlap in AI architecture as a pathway to improve safety. Under the view that both Transformer and Selective SSM blocks will survive up until the AGI (if it is going to be created at all, of course), and even with the addition of your qualifications (that AutoML will try to stack these and other types of blocks in some quickly evolving ways), the approach seems solid to me, but only if we also make some basic assumptions about the good faith and cooperativeness of the AutoML / auto takeoff process. If we don’t make such assumptions, of course, all bets are off, these “blocks for safety” could just be purged from the architecture.
Yes, I strongly suspect that “adversarial” safety approaches are quite doomed. The more one thinks about those, the worse they look.
We need to figure out how to make “cooperative” approaches to work reliably. In this sense, I have a feeling that, in particular, the approach being developed by OpenAI has been gradually shifting in that direction (judging, for example, by this interview with Ilya I transcribed: Ilya Sutskever’s thoughts on AI safety (July 2023): a transcript with my comments).
I agree with everything you said. Seems that we should distinguish between a sort of “cooperative” and “adversarial” safety approaches (cf. the comment above). I wrote the entire post as an extended reply to Marc Carauleanu upon his mixed feedback to my idea of adding “selective SSM blocks for theory of mind” to increase the Self-Other Overlap in AI architecture as a pathway to improve safety. Under the view that both Transformer and Selective SSM blocks will survive up until the AGI (if it is going to be created at all, of course), and even with the addition of your qualifications (that AutoML will try to stack these and other types of blocks in some quickly evolving ways), the approach seems solid to me, but only if we also make some basic assumptions about the good faith and cooperativeness of the AutoML / auto takeoff process. If we don’t make such assumptions, of course, all bets are off, these “blocks for safety” could just be purged from the architecture.
Yes, I strongly suspect that “adversarial” safety approaches are quite doomed. The more one thinks about those, the worse they look.
We need to figure out how to make “cooperative” approaches to work reliably. In this sense, I have a feeling that, in particular, the approach being developed by OpenAI has been gradually shifting in that direction (judging, for example, by this interview with Ilya I transcribed: Ilya Sutskever’s thoughts on AI safety (July 2023): a transcript with my comments).