Written quickly rather than not at all, I was thinking about this a couple of days ago, and decided to commit to writing something by today rather than adding this idea to my list of a million things to write up.
This post describes a toy alignment problem that I’d like to see work on. Even the easiest version of the problem seems nontrivial, but it seems significantly more tractable than harder versions, easy to “verify”, and also fairly easy to describe. I don’t think the idea is original, as it’s kind of obvious, but I haven’t yet seen a good attempt. I’ve called it, not entirely seriously ‘ELCK’, which stands for ‘Eliciting Latent Chess Knowledge’. Some of the extensions I mention later can be called ELGK/ELGCK
In brief, the challenge is:
Create an AI ‘chess analyst’, which can explain the reasons behind moves by top engines and Super-GMs, as well as the best human analysts.
As-stated this would already be a significant achievement. Although it wouldn’t necessarily require progress on scalable oversight, given how much top-quality analysis exists on the internet already, even the problem of “getting something to automatically explain human-understandable thematic ideas” is something of a significant step towards (limited domain) ontology identification.
The bulk of my excitement about the project comes from the fact that there are modifications which would require nontrivial progress on some key parts of low-stakes alignment, while also being significantly more tractable than solving the entirety of ELK, or e.g. coming up with a working implementation of HCH.
Here are some ideas for different ‘difficulty levels’:
Create an AI model that can provide analysis of chess games that are rated as highly by top human players as the best human analysis, without training on any human analysis produced by players with a rating above N (but still with access to a chess engine). Reduce N.
Do this with access to a chess engine but limiting the number of positions that can be explored per move of the actual game to M. Reduce M.
Do any of the above with a training process that can be repeated for a different board game (e.g. go, Othello, anti-chess) without any nontrivial adjustments to the process and produce a world-class analyst for that game.
Even as soon as we get to level 1, the difficulty of explaining why certain moves were played is analogous to the difficulty of explaining why a superhuman AI took certain actions—it’s difficult to distinguish between rewarding ‘this is the reason the action was taken’ and ‘this is how a human wold describe the reason the action was taken’. It’s a significantly easier problem than the entire problem though, not just because of the limited domain of chess, but also because we have a way of getting access to something a bit like “ground truth” by exploring different parts of the game tree.
It seems plausible to me that solving any of the above tasks might generate non-trivial insights, but I also think that the resulting processes themselves might then enable further experimentation and problem specification, for example in trying to then train systems which refuse to use certain concepts in their analysis.
ELCK might require nontrivial scalable alignment progress, and seems tractable enough to try
Written quickly rather than not at all, I was thinking about this a couple of days ago, and decided to commit to writing something by today rather than adding this idea to my list of a million things to write up.
This post describes a toy alignment problem that I’d like to see work on. Even the easiest version of the problem seems nontrivial, but it seems significantly more tractable than harder versions, easy to “verify”, and also fairly easy to describe. I don’t think the idea is original, as it’s kind of obvious, but I haven’t yet seen a good attempt. I’ve called it, not entirely seriously ‘ELCK’, which stands for ‘Eliciting Latent Chess Knowledge’. Some of the extensions I mention later can be called ELGK/ELGCK
In brief, the challenge is:
As-stated this would already be a significant achievement. Although it wouldn’t necessarily require progress on scalable oversight, given how much top-quality analysis exists on the internet already, even the problem of “getting something to automatically explain human-understandable thematic ideas” is something of a significant step towards (limited domain) ontology identification.
The bulk of my excitement about the project comes from the fact that there are modifications which would require nontrivial progress on some key parts of low-stakes alignment, while also being significantly more tractable than solving the entirety of ELK, or e.g. coming up with a working implementation of HCH.
Here are some ideas for different ‘difficulty levels’:
Create an AI model that can provide analysis of chess games that are rated as highly by top human players as the best human analysis, without training on any human analysis produced by players with a rating above N (but still with access to a chess engine). Reduce N.
Do this with access to a chess engine but limiting the number of positions that can be explored per move of the actual game to M. Reduce M.
Do any of the above with a training process that can be repeated for a different board game (e.g. go, Othello, anti-chess) without any nontrivial adjustments to the process and produce a world-class analyst for that game.
Even as soon as we get to level 1, the difficulty of explaining why certain moves were played is analogous to the difficulty of explaining why a superhuman AI took certain actions—it’s difficult to distinguish between rewarding ‘this is the reason the action was taken’ and ‘this is how a human wold describe the reason the action was taken’. It’s a significantly easier problem than the entire problem though, not just because of the limited domain of chess, but also because we have a way of getting access to something a bit like “ground truth” by exploring different parts of the game tree.
It seems plausible to me that solving any of the above tasks might generate non-trivial insights, but I also think that the resulting processes themselves might then enable further experimentation and problem specification, for example in trying to then train systems which refuse to use certain concepts in their analysis.