what happens if we automatically evaluate plans generated by superhuman AIs using current LLMs and then launch plans that our current LLMs look at and say, “this looks good”.
The obvious failure mode is that LLM is not powerful enough to predict consequences of the plan. The obvious fix is to include human-relevant description of the consequences. The obvious failure modes: manipulated description of the consequences, optimizing for LLM jail-breaking. The obvious fix: …
I won’t continue, but shallow rebuttals is not that convincing, but deep ones is close to capability research, so I don’t expect to find interesting answers.
The obvious failure mode is that LLM is not powerful enough to predict consequences of the plan. The obvious fix is to include human-relevant description of the consequences. The obvious failure modes: manipulated description of the consequences, optimizing for LLM jail-breaking. The obvious fix: …
I won’t continue, but shallow rebuttals is not that convincing, but deep ones is close to capability research, so I don’t expect to find interesting answers.