Seth Herd comments on [deactivated]’s Shortform

Seth Herd 16 Nov 2023 23:13 UTC
2 points
0
I think any reasonable estimate would be based on a more detailed plan. What types of rewards (loss function) are we providing, and what type of inner alignment we want.

My intuition roughly aligns with Eliezer’s on this point: I doubt this will work.

When I imagine rewarding an agent for doing things humans like, as indicated by smiles, thanks, etc. I have a hard time imagining that this just generalizes to an agent that does what we want, even in very different circumstances, including when it can relatively easily gain sovereignty and do whatever it wants.

Others have a different intuition. Buried in a comment somewhere from Quintin Pope, he says something to the effect of “shard theory isn’t a new theory of alignment; it’s the hypothesis that we dont need one”. I think he and other shard theory optimists think it’s entirely plausible that rewarding stuff we like will develop inner representations and alignment that’s adequate for our purposes.

While I share Eliezar and others’ pessimism about alignment through pure RL, I don’t share his overall pessimism. You’ve seen my alternate proposals for directly setting desirable goals out of an agent’s learned knowledge.