red75prime comments on Some quick thoughts on “AI is easy to control”

red75prime 6 Dec 2023 12:46 UTC
3 points
2
what happens if we automatically evaluate plans generated by superhuman AIs using current LLMs and then launch plans that our current LLMs look at and say, “this looks good”.
The obvious failure mode is that LLM is not powerful enough to predict consequences of the plan. The obvious fix is to include human-relevant description of the consequences. The obvious failure modes: manipulated description of the consequences, optimizing for LLM jail-breaking. The obvious fix: …
I won’t continue, but shallow rebuttals is not that convincing, but deep ones is close to capability research, so I don’t expect to find interesting answers.