### Limitations of HHH and other Static Dataset benchmarks
A Static Dataset is a dataset which will not grow or change—it will remain the same. Static dataset type benchmarks are inherently limited in what information they will tell us about a model. This is especially the case when we care about AI Alignment and want to measure how ‘aligned’ the AI is.
### Purpose of AI Alignment Benchmarks
When measuring AI Alignment, our aim is to find out exactly how close the model is to being the ultimate ‘aligned’ model that we’re seeking—a model whose preferences are compatible with ours, in a way that will empower humanity, not harm or disempower it.
### Difficulties of Designing AI Alignment Benchmarks What preferences those are, could be a significant part of the alignment problem. This means that we will need to frequently make sure we know what preferences we’re trying to measure for and re-determine if these are the correct ones to be aiming for.
### Key Properties of Aligned Models
These preferences must be both robustly and faithfully held by the model: Robustness: - They will be preserved over unlimited iterations of the model, without deterioration or deprioritization. - They will be robust to external attacks, manipulations, damage, etc of the model. Faithfulness: - The model ‘believes in’, ‘values’ or ‘holds to be true and important’ the preferences that we care about . - It doesn’t just store the preferences as information of equal priority to any other piece of information, e.g. how many cats are in Paris—but it holds them as its own, actual preferences.
Thoughts on this?
### Limitations of HHH and other Static Dataset benchmarks
A Static Dataset is a dataset which will not grow or change—it will remain the same. Static dataset type benchmarks are inherently limited in what information they will tell us about a model. This is especially the case when we care about AI Alignment and want to measure how ‘aligned’ the AI is.
### Purpose of AI Alignment Benchmarks
When measuring AI Alignment, our aim is to find out exactly how close the model is to being the ultimate ‘aligned’ model that we’re seeking—a model whose preferences are compatible with ours, in a way that will empower humanity, not harm or disempower it.
### Difficulties of Designing AI Alignment Benchmarks
What preferences those are, could be a significant part of the alignment problem. This means that we will need to frequently make sure we know what preferences we’re trying to measure for and re-determine if these are the correct ones to be aiming for.
### Key Properties of Aligned Models
These preferences must be both robustly and faithfully held by the model:
Robustness:
- They will be preserved over unlimited iterations of the model, without deterioration or deprioritization.
- They will be robust to external attacks, manipulations, damage, etc of the model.
Faithfulness:
- The model ‘believes in’, ‘values’ or ‘holds to be true and important’ the preferences that we care about .
- It doesn’t just store the preferences as information of equal priority to any other piece of information, e.g. how many cats are in Paris—but it holds them as its own, actual preferences.
Comment on the Google Doc here: https://docs.google.com/document/d/1PHUqFN9E62_mF2J5KjcfBK7-GwKT97iu2Cuc7B4Or2w/edit?usp=sharing
This is for the AI Alignment Evals Hackathon: https://lu.ma/xjkxqcya by AI-Plans