I will keep track of all questions during our discussion and if there is anything that make sense to send over to you, I will or invite the attendees to do so.
I feel like we as a community still haven’t really explored the full space of possible prosaic AI alignment approaches
I agree and I have mixed feelings about the current trend of converging towards somehow equivalent approaches all containing a flavour of recursive supervision (at least 8 of your 11). On one hand, the fact that many attempts point to a similar direction is a good indication of the potential of such a direction. On the other hand, its likelihood of succeeding may be lower than a portoflio approach, which seemed like what the community was originally aiming for. However, I (and I supect most of junior researchers too) don’t have a strong intuition on what very different directions migth be promising. Perhaps one possibility would be to not completely abandon modelling humans. While it is undoubtly hard, it may be worth exploring this possiblity from a ML perspective as well, since others are still working on it from a theoretical perspective. It may be that granted some breaktroughts in Neuroscience, it could be less hard that what we anticipate.
Another open problem is improving our understanding of transparency and interpretability
Also agree. I find it a bit vague, in fact, whenever you refer to “transparency tools” in the post. However, if we aim for some kind of guarantees, this problem may either involve modelling humans or loop back to the main alignment problem. In the sense that specifying the success of a transparency tool, is itself prone to specification error and outer/inner alignment problems. Not sure my point here is clear, but is something I am interested on pondering aboud.
Thanks for all the post pointers. I will have an in-depth read.
I will keep track of all questions during our discussion and if there is anything that make sense to send over to you, I will or invite the attendees to do so.
I agree and I have mixed feelings about the current trend of converging towards somehow equivalent approaches all containing a flavour of recursive supervision (at least 8 of your 11). On one hand, the fact that many attempts point to a similar direction is a good indication of the potential of such a direction. On the other hand, its likelihood of succeeding may be lower than a portoflio approach, which seemed like what the community was originally aiming for. However, I (and I supect most of junior researchers too) don’t have a strong intuition on what very different directions migth be promising. Perhaps one possibility would be to not completely abandon modelling humans. While it is undoubtly hard, it may be worth exploring this possiblity from a ML perspective as well, since others are still working on it from a theoretical perspective. It may be that granted some breaktroughts in Neuroscience, it could be less hard that what we anticipate.
Also agree. I find it a bit vague, in fact, whenever you refer to “transparency tools” in the post. However, if we aim for some kind of guarantees, this problem may either involve modelling humans or loop back to the main alignment problem. In the sense that specifying the success of a transparency tool, is itself prone to specification error and outer/inner alignment problems. Not sure my point here is clear, but is something I am interested on pondering aboud.
Thanks for all the post pointers. I will have an in-depth read.