Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 25 Oct 2023 6:16 UTC
14 points
12
Here’s a gdoc comment I made recently that might be of wider interest:

You know I wonder if this standard model of final goals vs. instrumental goals has it almost exactly backwards. Would love to discuss sometime.

Maybe there’s no such thing as a final goal directly. We start with a concept of “goal” and then we say that the system has machinery/heuristics for generating new goals given a context (context may or may not contain goals ‘on the table’ already). For example, maybe the algorithm for Daniel is something like:
--If context is [safe surroundings]+[no goals]+[hunger], add the goal “get food.”
—If context is [safe surroundings]+[travel-related-goal]+[no other goals], Engage Route Planning Module.
-- … (many such things like this)

It’s a huge messy kludge, but it’s gradually becoming more coherent as I get older and smarter and do more reflection.

What are final goals?
Well a goal is final for me to the extent that it tends to appear in a wide range of circumstances, to the extent that it tends to appear unprompted by any other goals, to the extent that it tends to take priority over other goals, … some such list of things like that.

For a mind like this, my final goals can be super super unstable and finicky and stuff like taking a philosophy class with a student who I have a crush on who endorses ideology X can totally change my final goals, because I have some sort of [basic needs met, time to think about long-term life ambitions] context and it so happens that I’ve learned (perhaps by experience, perhaps by imitation) to engage my philosophical reasoning module in that context, and also I’ve built my identity around being “Rational” in a way that makes me motivated to hook up my instrumental reasoning abilities to whatever my philosophical reasoning module shits out… meanwhile my philosophical reasoning module is basically just imitating patterns of thought I’ve seen high-status cool philosophers make (including this crush) and applying those patterns to whatever mental concepts and arguments are at hand.

It’s a fucking mess.

But I think it’s how minds work.
- Richard_Ngo 25 Jul 2024 5:28 UTC
  4 points
  0
  Parent
  Relevant: my post on value systematization
  Though I have a sneaking suspicion that this comment was originally made on a draft of that?
  - Daniel Kokotajlo 26 Jul 2024 13:24 UTC
    2 points
    0
    Parent
    At this point I don’t remember! But I think not, I think it was a comment on one of Carlsmith’s drafts about powerseeking AI and deceptive alignment.
- Daniel Kokotajlo 25 Oct 2023 14:08 UTC
  4 points
  −1
  Parent
  To follow up, this might have big implications for understanding AGI. First of all, it’s possible that we’ll build AGIs that aren’t like that and that do have final goals in the traditional sense—e.g. because they are a hybrid of neural nets and ordinary software, involving explicit tree search maybe, or because SGD is more powerful at coherentizing the neural net’s goals than whatever goes on in the brain. If so, then we’ll really be dealing with a completely different kind of being than humans, I think.
  
  Secondly, well, I discussed this three years ago in this post What if memes are common in highly capable minds? — LessWrong