The key question is 90% recovered relative to what. If you recover 90% of the loss relative to a 0 ablation baseline (that ablates the entire residual stream midway though the model!), that isn’t clearly that much.
E.g., if full zero ablation is 13 CE loss (seems plausible) and the SAE gets you to 3 CE while the original model was at 2 CE, this is 90%, but you have also massively degraded performance in terms of effective training compute.
The key question is 90% recovered relative to what. If you recover 90% of the loss relative to a 0 ablation baseline (that ablates the entire residual stream midway though the model!), that isn’t clearly that much.
E.g., if full zero ablation is 13 CE loss (seems plausible) and the SAE gets you to 3 CE while the original model was at 2 CE, this is 90%, but you have also massively degraded performance in terms of effective training compute.
IDK about literal 100%.