There seems to be a lot of giant cheesecake fallacy in AI risk. Only things leading up to AGI threshold are relevant to the AI risk faced by humans, the rest is AGIs’ problem.
Given current capability of ChatGPT with imminent potential to get it a day-long context window, there is nothing left but tuning, including self-tuning, to reach AGI threshold. There is no need to change anything at all in its architecture or basic training setup to become AGI, only that tuning to get it over a sanity/agency threshold of productive autonomous activity, and iterative batch retraining on new self-written data/reports/research. It could be done much better in other ways, but it’s no longer necessary to change anything to get there.
So AI risk is now exclusively about fine tuning of LLMs, anything else is giant cheesecake fallacy, something possible in principle but not relevant now and thus probably ever, as something humanity can influence. Though that’s still everything but the kitchen sink, fine tuning could make use of any observations about alignment, decision theory, and so on, possibly just as informal arguments being fed at key points to LLMs, cumulatively to decisive effect.
Adversarial robustness is the wrong frame for alignment.
Robustness to adversarial optimisation is very difficult[1].
Cybersecurity requires adversarial robustness, intent alignment does not.
There’s no malicious ghost trying to exploit weaknesses in our alignment techniques.
This is probably my most heretical (and for good reason) alignment take.
It’s something dangerous to be wrong about.
I think the only way such a malicious ghost could arise is via mesa-optimisers, but I expect such malicious dameons to be unlikely apriori.
That is, you’ll need a training environment that exerts significant selection pressure for maliciousness/adversarialness for the property to arise.
Most capable models don’t have malicious daemons[2], so it won’t emerge by default.
[1]: Especially if the adversary is a more powerful optimiser than you.
[2]: Citation needed.
There seems to be a lot of giant cheesecake fallacy in AI risk. Only things leading up to AGI threshold are relevant to the AI risk faced by humans, the rest is AGIs’ problem.
Given current capability of ChatGPT with imminent potential to get it a day-long context window, there is nothing left but tuning, including self-tuning, to reach AGI threshold. There is no need to change anything at all in its architecture or basic training setup to become AGI, only that tuning to get it over a sanity/agency threshold of productive autonomous activity, and iterative batch retraining on new self-written data/reports/research. It could be done much better in other ways, but it’s no longer necessary to change anything to get there.
So AI risk is now exclusively about fine tuning of LLMs, anything else is giant cheesecake fallacy, something possible in principle but not relevant now and thus probably ever, as something humanity can influence. Though that’s still everything but the kitchen sink, fine tuning could make use of any observations about alignment, decision theory, and so on, possibly just as informal arguments being fed at key points to LLMs, cumulatively to decisive effect.