This post seemed overconfident in a number of places, so I was quickly pushing back in those places.
I also think the conclusion of “Nearly No Data” is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn’t seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.
If this post argued “the fact that current chat bots trained normally don’t seem to exhibit catastrophic misalignment isn’t much evidence about catastrophic misalignment in more powerful systems”, then I wouldn’t think this was overstated (though this also wouldn’t be very original). But, it makes stronger claims which seem false to me.
Mm, I concede that this might not have been the most accurate title. I might’ve let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.
My core point is something like “the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI’s cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for”.
I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.
This post seemed overconfident in a number of places, so I was quickly pushing back in those places.
I also think the conclusion of “Nearly No Data” is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn’t seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.
If this post argued “the fact that current chat bots trained normally don’t seem to exhibit catastrophic misalignment isn’t much evidence about catastrophic misalignment in more powerful systems”, then I wouldn’t think this was overstated (though this also wouldn’t be very original). But, it makes stronger claims which seem false to me.
Mm, I concede that this might not have been the most accurate title. I might’ve let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.
My core point is something like “the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI’s cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for”.
I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.
OK, that seems reasonable to me.