Great report — I found the argument that ELK is a core challenge for alignment quite intuitive/compelling.
To build more intuition for what a solution to ELK would look like, I’d find it useful to talk about current-day settings where we could attempt to empirically tackle ELK. AlphaZero seems like a good example of a superhuman ML model where there’s significant interest (and some initial work: https://arxiv.org/abs/2111.09259) in understanding its inner reasoning. Some AlphaZero-oriented questions that occurred to me:
Suppose we train an augmented version of AZ (call it AZELK), with reasonable extra resources proportional to the training cost of AZ, that can explain its reasoning for choosing a particular move, or assigning a particular value to a board state. Would this represent significant progress towards the general ELK problem you propose?
AZELK seems to have similar issues to the ones described for SmartVault — e.g. preferring to give simple explanations if they satisfy the human user. Is there any particular issue presented by SmartVault that AZELK wouldn’t capture?
How should AZELK behave in situations where its internal concepts are totally foreign to the human user? For example, I know next to nothing about go and chess, so even if the model is reasoning about standard things like openings or pawn structure, it would need to explain those to me. Should it offer to explain them to me? This is referred to in the report as “doing science” / improving human understanding, but I’m having trouble imagining what the alternative is for AZELK.
I could make the problem of training AZELK artificially more difficult by not allowing the use of human explanations of games, and only allowing interaction with non-experts. Does this seem like a useful restriction?
Another instance of AZELK I could imagine being interesting, is the problem of uncovering a sabotaged AZ. Perhaps the model was trained to make incorrect moves in certain circumstances, or its reward was subtly mis-specified. Does this seem like a realistic problem for ELK to help with? (Maybe it’s useful to assume we only have access to the policy, rather than the value function.)
A separate question that’s a bit further afield— Is it useful to think about eliciting latent knowledge from a human? For example, I might imagine sitting down with a Go expert (perhaps entirely self-taught so they don’t have much experience explaining to other humans), playing some games with them and trying to understand why they’re making certain decisions. Is there any aspect of the ELK problem that this scenario does/doesn’t capture?
I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn’t be able to evaluate that answer.
Some thoughts on this setup:
I’m very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be great to (i) check whether that actually happens (ii) have an empirical model of a hard situation so that we can do applied research rather than just theory.
There is some subtlety where AZ invokes the policy/value a bunch of times in order to make a single move. I don’t think this is a fundamental complication, so from here on out I’ll just talk about ELK for a single value function invocation. I don’t think the problem is very interesting unless the AZ value function itself is much stronger than your humans.
Many questions about Go can be easily answered with a lot of compute, and for many of these questions there is a plausible straightforward approach based on debate/amplification. I think this is also interesting to do experiments with, but I’m most worried about the cases where this is not possible (e.g. the ontology identification case, which probably arises in Go but is a bit more subtle).
If a human doesn’t know anything about Go, then AZ may simply not have any latent knowledge that is meaningful to them. In that case we aren’t expecting/requiring ELK to do anything at all. So we’d like to focus on cases where the human does understand concepts that they can ask hard questions about. (And ideally they’d have a rich web of concepts so that the question feels analogous to the real world case, but I think it’s interesting as long they have anything.) We never expect it to walk us through pedagogy, and we’re trying to define a utility function that also doesn’t require pedagogy in the real world, i.e. that is defined in terms of familiar concepts. I think it would make sense to study how to get AZ to explain basic principles of Go to someone who lacks any relevant concepts, but I don’t think it would be analogous to ELK in particular.
I think it’s important that you have access to human explanations, or answers to questions, or discussions about what concepts mean. This is the only way you’re anchoring the meaning of terms, and generally important for most of the approaches. This is a central part fo why we’re only aiming at training the system to talk about human concepts.
I think it’s important that AZELK is trained by humans who are much worse at Go than AZ. Otherwise it doesn’t seem helpfully analogous to long-run problems. And I don’t see much upside to doing such a project with experts rather than amateurs. I think that most likely you’d want to do it with Go amateurs (e.g. 10k or even weaker). It’s possible that you need fairly weak humans before AZ actually has intuitions that the human couldn’t arbitrate a debate about, but that would already be interesting to learn and so I wouldn’t stress about it at first (and I would consider debate and amplification as “in bounds” until we could find some hard case where they failed, initial steps might not be analogous to the hardest parts of ELK but that’s fine).
I don’t expect AZELK to ever talk about why it chose a move or “what it’s thinking” or so on—just to explain what it knows about the state of the board (and the states of the board it considered in its search and so on). I don’t think it would be possible to detect a sabotaged version of the model.
You could imagine eliciting knowledge from a human expert. I think that most of the mechanisms would amount to clever incentives for compensating them. Again, I don’t think the interesting part is understanding why they are making moves per se, it’s just getting them to explain important facts about particular board states that you couldn’t have figured out on your own. I think that many possible approaches to ELK won’t be applicable to humans, e.g. you can’t do regularization based on the structure of the model. Basically all you can do are behavioral incentives + applying time pressure, and that doesn’t look like enough to solve the problem.
I think it’s also reasonable to talk about ELK in various synthetic settings, or in the case of generative modeling (probably in domains where humans have a weak understanding). Board games seem useful because your AI can so easily be superhuman, but they can have problems because there isn’t necessarily that much latent structure.
Great report — I found the argument that ELK is a core challenge for alignment quite intuitive/compelling.
To build more intuition for what a solution to ELK would look like, I’d find it useful to talk about current-day settings where we could attempt to empirically tackle ELK. AlphaZero seems like a good example of a superhuman ML model where there’s significant interest (and some initial work: https://arxiv.org/abs/2111.09259) in understanding its inner reasoning. Some AlphaZero-oriented questions that occurred to me:
Suppose we train an augmented version of AZ (call it AZELK), with reasonable extra resources proportional to the training cost of AZ, that can explain its reasoning for choosing a particular move, or assigning a particular value to a board state. Would this represent significant progress towards the general ELK problem you propose?
AZELK seems to have similar issues to the ones described for SmartVault — e.g. preferring to give simple explanations if they satisfy the human user. Is there any particular issue presented by SmartVault that AZELK wouldn’t capture?
How should AZELK behave in situations where its internal concepts are totally foreign to the human user? For example, I know next to nothing about go and chess, so even if the model is reasoning about standard things like openings or pawn structure, it would need to explain those to me. Should it offer to explain them to me? This is referred to in the report as “doing science” / improving human understanding, but I’m having trouble imagining what the alternative is for AZELK.
I could make the problem of training AZELK artificially more difficult by not allowing the use of human explanations of games, and only allowing interaction with non-experts. Does this seem like a useful restriction?
Another instance of AZELK I could imagine being interesting, is the problem of uncovering a sabotaged AZ. Perhaps the model was trained to make incorrect moves in certain circumstances, or its reward was subtly mis-specified. Does this seem like a realistic problem for ELK to help with? (Maybe it’s useful to assume we only have access to the policy, rather than the value function.)
A separate question that’s a bit further afield— Is it useful to think about eliciting latent knowledge from a human? For example, I might imagine sitting down with a Go expert (perhaps entirely self-taught so they don’t have much experience explaining to other humans), playing some games with them and trying to understand why they’re making certain decisions. Is there any aspect of the ELK problem that this scenario does/doesn’t capture?
I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn’t be able to evaluate that answer.
Some thoughts on this setup:
I’m very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be great to (i) check whether that actually happens (ii) have an empirical model of a hard situation so that we can do applied research rather than just theory.
There is some subtlety where AZ invokes the policy/value a bunch of times in order to make a single move. I don’t think this is a fundamental complication, so from here on out I’ll just talk about ELK for a single value function invocation. I don’t think the problem is very interesting unless the AZ value function itself is much stronger than your humans.
Many questions about Go can be easily answered with a lot of compute, and for many of these questions there is a plausible straightforward approach based on debate/amplification. I think this is also interesting to do experiments with, but I’m most worried about the cases where this is not possible (e.g. the ontology identification case, which probably arises in Go but is a bit more subtle).
If a human doesn’t know anything about Go, then AZ may simply not have any latent knowledge that is meaningful to them. In that case we aren’t expecting/requiring ELK to do anything at all. So we’d like to focus on cases where the human does understand concepts that they can ask hard questions about. (And ideally they’d have a rich web of concepts so that the question feels analogous to the real world case, but I think it’s interesting as long they have anything.) We never expect it to walk us through pedagogy, and we’re trying to define a utility function that also doesn’t require pedagogy in the real world, i.e. that is defined in terms of familiar concepts. I think it would make sense to study how to get AZ to explain basic principles of Go to someone who lacks any relevant concepts, but I don’t think it would be analogous to ELK in particular.
I think it’s important that you have access to human explanations, or answers to questions, or discussions about what concepts mean. This is the only way you’re anchoring the meaning of terms, and generally important for most of the approaches. This is a central part fo why we’re only aiming at training the system to talk about human concepts.
I think it’s important that AZELK is trained by humans who are much worse at Go than AZ. Otherwise it doesn’t seem helpfully analogous to long-run problems. And I don’t see much upside to doing such a project with experts rather than amateurs. I think that most likely you’d want to do it with Go amateurs (e.g. 10k or even weaker). It’s possible that you need fairly weak humans before AZ actually has intuitions that the human couldn’t arbitrate a debate about, but that would already be interesting to learn and so I wouldn’t stress about it at first (and I would consider debate and amplification as “in bounds” until we could find some hard case where they failed, initial steps might not be analogous to the hardest parts of ELK but that’s fine).
I don’t expect AZELK to ever talk about why it chose a move or “what it’s thinking” or so on—just to explain what it knows about the state of the board (and the states of the board it considered in its search and so on). I don’t think it would be possible to detect a sabotaged version of the model.
You could imagine eliciting knowledge from a human expert. I think that most of the mechanisms would amount to clever incentives for compensating them. Again, I don’t think the interesting part is understanding why they are making moves per se, it’s just getting them to explain important facts about particular board states that you couldn’t have figured out on your own. I think that many possible approaches to ELK won’t be applicable to humans, e.g. you can’t do regularization based on the structure of the model. Basically all you can do are behavioral incentives + applying time pressure, and that doesn’t look like enough to solve the problem.
I think it’s also reasonable to talk about ELK in various synthetic settings, or in the case of generative modeling (probably in domains where humans have a weak understanding). Board games seem useful because your AI can so easily be superhuman, but they can have problems because there isn’t necessarily that much latent structure.