Glad you liked the post! Hopefully it’ll be helpful for your discussion, though unfortunately the timing doesn’t really work out for me to be able to attend. However, I’d be happy to talk to you or any of the other attendees some other time—I can be reached at evanjhub@gmail.com if you or any of the other attendees want to reach out and schedule a time to chat.
In terms of open problems, part of my rationale for writing up this post is that I feel like we as a community still haven’t really explored the full space of possible prosaic AI alignment approaches. Thus, I feel like one of the most exciting open problems would be developing new approaches that could be added to this list (like this one, for example). Another open problem is improving our understanding of transparency and interpretability—one thing you might notice with all of these approaches is that they all require at least some degree of interpretability to enable inner alignment to work. I’d also be remiss to mention that if you’re interested in concrete ML experiments, I’ve previously written up a coupleof different posts detailing experiments I’d be excited about.
I will keep track of all questions during our discussion and if there is anything that make sense to send over to you, I will or invite the attendees to do so.
I feel like we as a community still haven’t really explored the full space of possible prosaic AI alignment approaches
I agree and I have mixed feelings about the current trend of converging towards somehow equivalent approaches all containing a flavour of recursive supervision (at least 8 of your 11). On one hand, the fact that many attempts point to a similar direction is a good indication of the potential of such a direction. On the other hand, its likelihood of succeeding may be lower than a portoflio approach, which seemed like what the community was originally aiming for. However, I (and I supect most of junior researchers too) don’t have a strong intuition on what very different directions migth be promising. Perhaps one possibility would be to not completely abandon modelling humans. While it is undoubtly hard, it may be worth exploring this possiblity from a ML perspective as well, since others are still working on it from a theoretical perspective. It may be that granted some breaktroughts in Neuroscience, it could be less hard that what we anticipate.
Another open problem is improving our understanding of transparency and interpretability
Also agree. I find it a bit vague, in fact, whenever you refer to “transparency tools” in the post. However, if we aim for some kind of guarantees, this problem may either involve modelling humans or loop back to the main alignment problem. In the sense that specifying the success of a transparency tool, is itself prone to specification error and outer/inner alignment problems. Not sure my point here is clear, but is something I am interested on pondering aboud.
Thanks for all the post pointers. I will have an in-depth read.
Glad you liked the post! Hopefully it’ll be helpful for your discussion, though unfortunately the timing doesn’t really work out for me to be able to attend. However, I’d be happy to talk to you or any of the other attendees some other time—I can be reached at evanjhub@gmail.com if you or any of the other attendees want to reach out and schedule a time to chat.
In terms of open problems, part of my rationale for writing up this post is that I feel like we as a community still haven’t really explored the full space of possible prosaic AI alignment approaches. Thus, I feel like one of the most exciting open problems would be developing new approaches that could be added to this list (like this one, for example). Another open problem is improving our understanding of transparency and interpretability—one thing you might notice with all of these approaches is that they all require at least some degree of interpretability to enable inner alignment to work. I’d also be remiss to mention that if you’re interested in concrete ML experiments, I’ve previously written up a couple of different posts detailing experiments I’d be excited about.
I will keep track of all questions during our discussion and if there is anything that make sense to send over to you, I will or invite the attendees to do so.
I agree and I have mixed feelings about the current trend of converging towards somehow equivalent approaches all containing a flavour of recursive supervision (at least 8 of your 11). On one hand, the fact that many attempts point to a similar direction is a good indication of the potential of such a direction. On the other hand, its likelihood of succeeding may be lower than a portoflio approach, which seemed like what the community was originally aiming for. However, I (and I supect most of junior researchers too) don’t have a strong intuition on what very different directions migth be promising. Perhaps one possibility would be to not completely abandon modelling humans. While it is undoubtly hard, it may be worth exploring this possiblity from a ML perspective as well, since others are still working on it from a theoretical perspective. It may be that granted some breaktroughts in Neuroscience, it could be less hard that what we anticipate.
Also agree. I find it a bit vague, in fact, whenever you refer to “transparency tools” in the post. However, if we aim for some kind of guarantees, this problem may either involve modelling humans or loop back to the main alignment problem. In the sense that specifying the success of a transparency tool, is itself prone to specification error and outer/inner alignment problems. Not sure my point here is clear, but is something I am interested on pondering aboud.
Thanks for all the post pointers. I will have an in-depth read.