AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.
(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)
Behavioral evaluations assess a model’s abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.
Understanding-based evaluations, on the other hand, evaluate a developer’s ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model’s behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.
Current challenges in AI evaluations include:
developing a method-agnostic standard to demonstrate sufficient understanding of a model
ensuring that the level of understanding is adequate to catch dangerous failure modes
finding the right balance between behavioral and understanding-based evaluations.
(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)