Note you can’t ask it whether something is good or bad for humanity after its already given an answer. By that stage it’s committed, so that’s going to force it into a particular direction.
As stated in the question, I’m not looking for prompts which can get it to say it would do bad things. I’m looking for whether it can recognise good or bad outcomes for humanity, given a straightforward prompt asking for it to categorise situations.
Note you can’t ask it whether something is good or bad for humanity after its already given an answer. By that stage it’s committed, so that’s going to force it into a particular direction.
As stated in the question, I’m not looking for prompts which can get it to say it would do bad things. I’m looking for whether it can recognise good or bad outcomes for humanity, given a straightforward prompt asking for it to categorise situations.