Automating scalable oversight or RLHF research by quickly discovering new loss functions for training
See Discovering Preference Optimization Algorithms with and for Large Language Models.
Automating the ability to probe AI systems for deceptive power-seeking goals, via automated discovery of low-level features using interpretability tools
See A Multimodal Automated Interpretability Agent and Open Source Automated Interpretability for Sparse Autoencoder Features.
See Discovering Preference Optimization Algorithms with and for Large Language Models.
See A Multimodal Automated Interpretability Agent and Open Source Automated Interpretability for Sparse Autoencoder Features.