I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn’t be able to evaluate that answer.
Some thoughts on this setup:
I’m very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be great to (i) check whether that actually happens (ii) have an empirical model of a hard situation so that we can do applied research rather than just theory.
There is some subtlety where AZ invokes the policy/value a bunch of times in order to make a single move. I don’t think this is a fundamental complication, so from here on out I’ll just talk about ELK for a single value function invocation. I don’t think the problem is very interesting unless the AZ value function itself is much stronger than your humans.
Many questions about Go can be easily answered with a lot of compute, and for many of these questions there is a plausible straightforward approach based on debate/amplification. I think this is also interesting to do experiments with, but I’m most worried about the cases where this is not possible (e.g. the ontology identification case, which probably arises in Go but is a bit more subtle).
If a human doesn’t know anything about Go, then AZ may simply not have any latent knowledge that is meaningful to them. In that case we aren’t expecting/requiring ELK to do anything at all. So we’d like to focus on cases where the human does understand concepts that they can ask hard questions about. (And ideally they’d have a rich web of concepts so that the question feels analogous to the real world case, but I think it’s interesting as long they have anything.) We never expect it to walk us through pedagogy, and we’re trying to define a utility function that also doesn’t require pedagogy in the real world, i.e. that is defined in terms of familiar concepts. I think it would make sense to study how to get AZ to explain basic principles of Go to someone who lacks any relevant concepts, but I don’t think it would be analogous to ELK in particular.
I think it’s important that you have access to human explanations, or answers to questions, or discussions about what concepts mean. This is the only way you’re anchoring the meaning of terms, and generally important for most of the approaches. This is a central part fo why we’re only aiming at training the system to talk about human concepts.
I think it’s important that AZELK is trained by humans who are much worse at Go than AZ. Otherwise it doesn’t seem helpfully analogous to long-run problems. And I don’t see much upside to doing such a project with experts rather than amateurs. I think that most likely you’d want to do it with Go amateurs (e.g. 10k or even weaker). It’s possible that you need fairly weak humans before AZ actually has intuitions that the human couldn’t arbitrate a debate about, but that would already be interesting to learn and so I wouldn’t stress about it at first (and I would consider debate and amplification as “in bounds” until we could find some hard case where they failed, initial steps might not be analogous to the hardest parts of ELK but that’s fine).
I don’t expect AZELK to ever talk about why it chose a move or “what it’s thinking” or so on—just to explain what it knows about the state of the board (and the states of the board it considered in its search and so on). I don’t think it would be possible to detect a sabotaged version of the model.
You could imagine eliciting knowledge from a human expert. I think that most of the mechanisms would amount to clever incentives for compensating them. Again, I don’t think the interesting part is understanding why they are making moves per se, it’s just getting them to explain important facts about particular board states that you couldn’t have figured out on your own. I think that many possible approaches to ELK won’t be applicable to humans, e.g. you can’t do regularization based on the structure of the model. Basically all you can do are behavioral incentives + applying time pressure, and that doesn’t look like enough to solve the problem.
I think it’s also reasonable to talk about ELK in various synthetic settings, or in the case of generative modeling (probably in domains where humans have a weak understanding). Board games seem useful because your AI can so easily be superhuman, but they can have problems because there isn’t necessarily that much latent structure.
I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn’t be able to evaluate that answer.
Some thoughts on this setup:
I’m very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be great to (i) check whether that actually happens (ii) have an empirical model of a hard situation so that we can do applied research rather than just theory.
There is some subtlety where AZ invokes the policy/value a bunch of times in order to make a single move. I don’t think this is a fundamental complication, so from here on out I’ll just talk about ELK for a single value function invocation. I don’t think the problem is very interesting unless the AZ value function itself is much stronger than your humans.
Many questions about Go can be easily answered with a lot of compute, and for many of these questions there is a plausible straightforward approach based on debate/amplification. I think this is also interesting to do experiments with, but I’m most worried about the cases where this is not possible (e.g. the ontology identification case, which probably arises in Go but is a bit more subtle).
If a human doesn’t know anything about Go, then AZ may simply not have any latent knowledge that is meaningful to them. In that case we aren’t expecting/requiring ELK to do anything at all. So we’d like to focus on cases where the human does understand concepts that they can ask hard questions about. (And ideally they’d have a rich web of concepts so that the question feels analogous to the real world case, but I think it’s interesting as long they have anything.) We never expect it to walk us through pedagogy, and we’re trying to define a utility function that also doesn’t require pedagogy in the real world, i.e. that is defined in terms of familiar concepts. I think it would make sense to study how to get AZ to explain basic principles of Go to someone who lacks any relevant concepts, but I don’t think it would be analogous to ELK in particular.
I think it’s important that you have access to human explanations, or answers to questions, or discussions about what concepts mean. This is the only way you’re anchoring the meaning of terms, and generally important for most of the approaches. This is a central part fo why we’re only aiming at training the system to talk about human concepts.
I think it’s important that AZELK is trained by humans who are much worse at Go than AZ. Otherwise it doesn’t seem helpfully analogous to long-run problems. And I don’t see much upside to doing such a project with experts rather than amateurs. I think that most likely you’d want to do it with Go amateurs (e.g. 10k or even weaker). It’s possible that you need fairly weak humans before AZ actually has intuitions that the human couldn’t arbitrate a debate about, but that would already be interesting to learn and so I wouldn’t stress about it at first (and I would consider debate and amplification as “in bounds” until we could find some hard case where they failed, initial steps might not be analogous to the hardest parts of ELK but that’s fine).
I don’t expect AZELK to ever talk about why it chose a move or “what it’s thinking” or so on—just to explain what it knows about the state of the board (and the states of the board it considered in its search and so on). I don’t think it would be possible to detect a sabotaged version of the model.
You could imagine eliciting knowledge from a human expert. I think that most of the mechanisms would amount to clever incentives for compensating them. Again, I don’t think the interesting part is understanding why they are making moves per se, it’s just getting them to explain important facts about particular board states that you couldn’t have figured out on your own. I think that many possible approaches to ELK won’t be applicable to humans, e.g. you can’t do regularization based on the structure of the model. Basically all you can do are behavioral incentives + applying time pressure, and that doesn’t look like enough to solve the problem.
I think it’s also reasonable to talk about ELK in various synthetic settings, or in the case of generative modeling (probably in domains where humans have a weak understanding). Board games seem useful because your AI can so easily be superhuman, but they can have problems because there isn’t necessarily that much latent structure.