Charlie Steiner comments on The Big Picture Of Alignment (Talk Part 1)

Charlie Steiner 24 Feb 2022 3:29 UTC
LW: 9 AF: 5
AF
Thanks a bunch!
1. I want to interrogate a little more the notion that gradient descent samples uniformly (or rather, are dominated by the initialization distribution) from good parameters. Have you read various things about grokking like Hypothesis: GD Prefers General Crictuits? That argument seems to be that you might start with parameters dominated by the initialization distribution, but various sorts of regularization are going to push you to sample solutions in a nonuniform way. Do you have a take on this?
2. For the power-seeking-because-of-entropy example, I want to second the audience questions. If you’re getting your policy by sampling from all possible policies, the argument is great, but if you’re getting your policy by sampling from NN parameters that generate strings of 100 actions, then you just finished arguing that uniform-ish sampling over NN parameters will give simplcity-ish sampling over policies. What would a NN do if trained to play the example game? I would assume it would quickly learn to exactly alternate $ and Apple. This looks like something that seems a little less like powerseeking, and more like telling DeepDream to fill the image with dogs, except filling a string with buying three apples. I dunno, do you think it’s still like powerseeking?
3. I think you make a subtle error when throwing out a lot of “mere biology” genes as not generating human values. If we had different mere biology than we do, the values we develop would probably be different even if our brain-specific genes were the same! Like, I dunno, suppose you have some genes that build your thyroid. But you can’t go “ho hum, the thyroid isn’t the brain, let’s throw those genes out as uninformative,” because thyroid disorders activity impacts your mood, which impacts your expressed values. Or I bet I’d have different values if my eyes saw in UV rather then visible, or my skin had no sense of pain, or I went through adolescence in two days rather than five years. Basically I totally disagree with this notion that “if we share it with plants, an AI wouldn’t need to know it.”
4. Actually I’m kinda not sure how relevant you think the size-of-human-preference-generators question is, since we don’t want the AI to learn human preferences in gene-format, we want the AI to learn human preferences in some (different, I think we agree) format that’s better-suited for doing things like making decisions or comparing between different humans.
5. Cool last section. If you can have 2 dimensions of things to be Pareto optimal over tradeoffs between, why not N dimensions? It seems like there are behaviors that are irrational even for markets (is failing to make mutually beneficial trades between individuals an example? I’m having trouble thinking of something less inward-facing) that could be “optimal” for decision-making procedures with N of 3 or 4.