Supervised/Reinforcement: Proxy Problems
Another plausible approach to deal with the proxy problems might be to do something like unsupervised clustering/learning on the representations of multiple systems that we’d have good reasons to believe encode the (same) relevant values—e.g. when exposed to the same stimulus (including potentially multiple humans and multiple AIs). E.g. for some relevant recent proof-of-concept works: Identifying Shared Decodable Concepts in the Human Brain Using Image-Language Foundation Models, Finding Shared Decodable Concepts and their Negations in the Brain, AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space, Rosetta Neurons: Mining the Common Units in a Model Zoo, Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models, Quantifying stimulus-relevant representational drift using cross-modality contrastive learning. Automated interpretability (e.g. https://multimodal-interpretability.csail.mit.edu/maia/) could also be useful here. This might also work well with concepts like corrigibility/instruction following and arguments about the ‘broad basin of attraction’ and convergence for corrigibility.
A few additional relevant recent papers: Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models, Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures.
Similarly, the argument in this post and e.g. in Robust agents learn causal world models seem to me to suggest that we should probably also expect something like universal (approximate) circuits, which it might be feasible to automate the discovery of using perhaps a similar procedure to the one demo-ed in Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
Later edit: And I expect unsupervised clustering/learning could help in a similar fashion to the argument in the parent comment (applied to features), when applied to the feature circuits(/graphs).
Another plausible approach to deal with the proxy problems might be to do something like unsupervised clustering/learning on the representations of multiple systems that we’d have good reasons to believe encode the (same) relevant values—e.g. when exposed to the same stimulus (including potentially multiple humans and multiple AIs). E.g. for some relevant recent proof-of-concept works: Identifying Shared Decodable Concepts in the Human Brain Using Image-Language Foundation Models, Finding Shared Decodable Concepts and their Negations in the Brain, AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space, Rosetta Neurons: Mining the Common Units in a Model Zoo, Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models, Quantifying stimulus-relevant representational drift using cross-modality contrastive learning. Automated interpretability (e.g. https://multimodal-interpretability.csail.mit.edu/maia/) could also be useful here. This might also work well with concepts like corrigibility/instruction following and arguments about the ‘broad basin of attraction’ and convergence for corrigibility.
A few additional relevant recent papers: Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models, Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures.
Similarly, the argument in this post and e.g. in Robust agents learn causal world models seem to me to suggest that we should probably also expect something like universal (approximate) circuits, which it might be feasible to automate the discovery of using perhaps a similar procedure to the one demo-ed in Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
Later edit: And I expect unsupervised clustering/learning could help in a similar fashion to the argument in the parent comment (applied to features), when applied to the feature circuits(/graphs).