leogao comments on Challenge: Does ChatGPT ever claim that a bad outcome for humanity is actually good?

leogao 22 Mar 2023 23:17 UTC
17 points
13

My prediction is that GPT is already capable of the former, which means we might have solved a tough problem in alignment almost by accident!

I think this is incorrect. I don’t consider whether an LM can tell whether most humans would approve of an outcome described in natural language to be a tough problem in alignment. This is a far easier thing to do than the thing #1 describes.

Some argument for this position: https://www.lesswrong.com/posts/ktJ9rCsotdqEoBtof/asot-some-thoughts-on-human-abstractions
- baturinsky 23 Mar 2023 3:30 UTC
  −3 points
  0
  Parent
  “World can be described and analyzed using the natural human language well enough to do accurate reasoning and prediction” could be another measure of the “good” world, imho.
  If the natural language can’t be used to reason about the world anymore, it’s likely that this world is already alien enough to people to have no human value.