wassname comments on Don’t Dismiss Simple Alignment Approaches

wassname 8 Oct 2023 3:51 UTC
10 points
2

In contrast, the last two techniques listed rely on current AI models being very powerful and quite steerable.

An alternative view is that we’ve been lucky. LLMs are trained by unsupervised learning and are almost oracles that are moderately aligned by default.

But soon someone will make them into Reinforcement Learning (RL) agents that can plan. They will do this because long-term planning is super useful and RL is the best way we have to do it. However, RL tends to make power-seeking agents that look for shortcuts and exploits ( most mid-specification examples are from RL ).

So I worry that we will see more unsafe examples soon.