It seems way more fruitful to do science on it. Check whether current interpretability methods still work, look for evidence of internal planning and deception, start running sandwiching experiments, try to remove capabilities from it, etc.
It seems way more fruitful to do science on it. Check whether current interpretability methods still work, look for evidence of internal planning and deception, start running sandwiching experiments, try to remove capabilities from it, etc.