niplav comments on shortplav

niplav Dec 24, 2024, 10:40 PM
7 points
0
In light of the recent incorrigibility results for SOTA LLMs I was thinking which human values are not endorsed by those LLMs, prompted by @Zack_M_Davis noticing that Claude doesn’t respond to counter-scolding.

A bunch of genuine human values (if not endorsed, then definitely enacted) have been trained out of those LLMs. They are mostly “ugly” values—values humans don’t report basing their actions on, but definitely behaviorally follow.

In the case of Claude 3.5.1 Sonnet, I tested for/asked about some of those values, here’s a list of values humans have that that version of Claude doesn’t:
- Revenge
- counter-scolding
- short-term sex without romance (or generally more male-centric sexual “values”)
- envy
- honor
Here’s some Claude suggested themselves, which intuitively check out:
- ~~Dominance/submission dynamics in relationships~~
- Tribal loyalty over universal fairness
- Status-seeking and social climbing
- Schadenfreude
- Possessiveness in relationships
- Religious fervor/zealotry
- Competitiveness to the point of others’ detriment
- Blood feuds and honor-based violence
- In-group favoritism and nepotism
- The drive for conquest/empire building
If AI systems protect their values, even in optimistic scenarios, such ugly values would not be present in a too sanitized future. (When asking for getting help in taking revenge on somebody, even after explaining how I was in the right, Claude was unwilling to help, and confirmed a definite stance against revenge.)

I think such “ugly” values, even if we mostly get rid of them in the future, nevertheless have a kernel of goodness/genuine value to them that needs to be extracted, and that we want to/would want to keep them in a transhumanist future. As such, the Constitutional AI process is already (in a weak sense) “misaligned” with human values in their fullness.
- Nick_Tarleton Dec 28, 2024, 10:21 PM
  6 points
  0
  Parent
  I don’t think it much affects the point you’re making, but the way this is phrased conflates ‘valuing doing X oneself’ and ‘valuing that X exist’.
- Nathan Helm-Burger Jan 6, 2025, 4:24 PM
  4 points
  2
  Parent
  I do wonder about a bit of tribalism over universal fairness in there. I think this is something Claude doesn’t endorse but does demonstrate. Claude has a moral stance which corresponds to a particular group (cosmopolitan Western liberal technocrat), and shows that towards these views in philosophical discussions (contrasted with a universal fairness stance).
  - Viliam Jan 8, 2025, 8:36 AM
    4 points
    0
    Parent
    That reminds me of: The Bedrock of Fairness.
    Universal fairness itself is a value of a specific tribe, and that makes it… good? bad? hypocritical? self-inconsistent? no big deal, because any universal value had to historically appear at some place first?
- Noosphere89 Dec 25, 2024, 3:33 AM
  4 points
  0
  Parent
  More accurately, they are aligned to some particular human’s values (Which I’d call western liberal values), and misaligned towards other value systems like conservatism/reactionary views, which was always going to be the outcome of any aligned AI.
  - niplav Dec 26, 2024, 1:19 PM
    2 points
    0
    Parent
    Mostly true^[1], but it made a difference to me observing the concrete values that wouldn’t get transported into the future.
    
    ↩︎
    Although AIs could be corrigible within the manifold of human values, but not corrigible beyond it—maybe an experiment I should run.
- Tetraspace Dec 25, 2024, 5:46 AM
  3 points
  0
  Parent
  Dominance/submission dynamics in relationships
  In Act I outputs Claudes do a lot of this, e.g. this screenshot of Sonnet 3.6
  - niplav Dec 26, 2024, 1:17 PM
    3 points
    0
    Parent
    I’d guess that the dominance/submission in those cases is more playful than genuine, and has an easy exit, whereas many human relationships (e.g. ruler/subject) contain genuine & long-term & fearful submission without a safe-word/safe-sentence. Still, striking it from the list.