I also think 70% on MMLU is extremely low, since that’s about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe.
Very far in qualitative capability or very far in effective flop?
I agree on the qualitative capability, but disagree on the effective flop.
It seems quite plausible (say 5%) that models with only 1,000x more training compute than GPT-3.5 pose a risk of catastrophe. This would be GPT-5.
Very far in qualitative capability or very far in effective flop?
I agree on the qualitative capability, but disagree on the effective flop.
It seems quite plausible (say 5%) that models with only 1,000x more training compute than GPT-3.5 pose a risk of catastrophe. This would be GPT-5.