A net saying “I’m thinking about ways to kill you” does not necessarily imply anything whatsoever about the net actually planning to kill you
Since these nets are optimized for consistency (as it makes textual output more likely), wouldn’t outputting text that is consistent with this “thought” be likely? E.g. convincing the user to kill themselves, maybe giving them a reason (by searching the web)?
Since these nets are optimized for consistency (as it makes textual output more likely), wouldn’t outputting text that is consistent with this “thought” be likely? E.g. convincing the user to kill themselves, maybe giving them a reason (by searching the web)?