Bogdan Ionut Cirstea comments on How difficult is AI Alignment?

Bogdan Ionut Cirstea 13 Sep 2024 19:31 UTC
9 points
0
Automating scalable oversight or RLHF research by quickly discovering new loss functions for training
See Discovering Preference Optimization Algorithms with and for Large Language Models.
Automating the ability to probe AI systems for deceptive power-seeking goals, via automated discovery of low-level features using interpretability tools
See A Multimodal Automated Interpretability Agent and Open Source Automated Interpretability for Sparse Autoencoder Features.