One possible concern with all of these approaches, however, is that they might require the base optimizer to be better at searching for problems than the mesa-optimizer is at hiding them. Furthermore, in order to create a base optimizer that can do this, we may need to first train it using a third optimizer, which might result in the same problem at this second layer.
Isn’t IDA meant to be a solution to this problem? Do you discuss IDA anywhere, maybe in the last post?
IDA is definitely a good candidate to solve problems of this form. I think IDA’s best properties are primarily outer alignment properties, but it does also have some good properties with respect to inner alignment such as allowing you to bootstrap an informed adversary by giving it access to your question-answer system as you’re training it. That being said, I suspect you could do something similar under a wide variety of different systems—bootstrapping an informed adversary is not necessarily unique to IDA. Unfortunately, we don’t discuss IDA much in the last post, though thinking about mesa-optimizers in IDA (and other proposals e.g. debate) is imo a very important goal, and our hope is to at the very least provide the tools so that we can then go and start answering questions of that form.
Thanks! In addition to wanting your take on IDA as a potential solution to inner alignment, I also brought up my question because that place seemed like a natural place to mention/cite IDA and related ideas, and by not doing that you could give the mistaken impression that nobody has proposed a good enough candidate solution to be worth mentioning. But it might be fine if you do at least mention it in the conclusions or somewhere else.
I just added a footnote mentioning IDA to this section of the paper, though I’m leaving it as is in the sequence to avoid messing up the bibliography numbering.
Isn’t IDA meant to be a solution to this problem? Do you discuss IDA anywhere, maybe in the last post?
IDA is definitely a good candidate to solve problems of this form. I think IDA’s best properties are primarily outer alignment properties, but it does also have some good properties with respect to inner alignment such as allowing you to bootstrap an informed adversary by giving it access to your question-answer system as you’re training it. That being said, I suspect you could do something similar under a wide variety of different systems—bootstrapping an informed adversary is not necessarily unique to IDA. Unfortunately, we don’t discuss IDA much in the last post, though thinking about mesa-optimizers in IDA (and other proposals e.g. debate) is imo a very important goal, and our hope is to at the very least provide the tools so that we can then go and start answering questions of that form.
Thanks! In addition to wanting your take on IDA as a potential solution to inner alignment, I also brought up my question because that place seemed like a natural place to mention/cite IDA and related ideas, and by not doing that you could give the mistaken impression that nobody has proposed a good enough candidate solution to be worth mentioning. But it might be fine if you do at least mention it in the conclusions or somewhere else.
I just added a footnote mentioning IDA to this section of the paper, though I’m leaving it as is in the sequence to avoid messing up the bibliography numbering.