rcoreilly comments on Human preferences as RL critic values—implications for alignment

rcoreilly 23 Mar 2023 20:14 UTC
3 points
0
So the system needs to draw a distinction between just imagining freedom and making a plan for action that is predicted to actually produce freedom. This seems like something that a critic system can learn pretty easily. It’s known that the rodent dopamine system can learn blockers, such as not predicting reward when a blue light comes on at the same time as the otherwise reward-predictive red light.
There are 2 separable problems here: A. can a critic learn new abstract values?; B. how does the critic distinguish reality from imagination? I don’t see how blocking provides a realistic solution to either? Can you spell out what the blocker is and how it might solve these problems?
In general, these are both critical problems with the open-ended “super critic” hypothesis—how does Montague deal with these? So far, I don’t see any good solution except a strong grounding to basic survival-relevant values, as any crack in the system seems like it will quickly spiral out of control, much like heroin..
I’m a fan of Tomasello’s idea that social & sharing motivations provide the underlying fixed value function that drives most of human open-ended behavior. And there is solid evidence that humans vs. chimps differ strongly in these basic motivations, so it seems plausible that it is “built in”—curious to hear more about your doubts on that data?
In short, I strongly doubt that an open-ended critic is viable: it is just too easy to short-circuit (wirehead). The socially-grounded critic also has a strong potential for bad local minima: basically the “mutual admiration society” of self-reinforcing social currency. The result is cults of all forms, including that represented by one of the current major political parties in the US… But inevitably these are self-terminating when they conflict strongly with more basic survival values..
- Seth Herd 24 Mar 2023 22:07 UTC
  1 point
  0
  Parent
  I don’t think Montague dealt with that issue much if at all. But it’s been a long time since I read the book.
  My biggest takeaway from Tomasello’s work was his observation that humans pay far more attention to other humans than monkeys do to monkeys. Direct reward for social approval is one possible mechanism, but it’s also possible that it’s some other bias in the system. I think hardwired reward for social approval is probably a real mechanism. But it’s also possible that the correlation between people’s approval and even more direct reward of food, water, and shelter play a large role in making human approval and disapproval a conditioned stimulus (or a fully “substituted” stimulus). But I don’t think that distinction is very relevant for guessing the scope of the critic’s association.
  But inevitably these are self-terminating when they conflict strongly with more basic survival values.
  I completely agree. This is the basis of my explanation for how humans could attribute value to abstract representations and not wirehead. In sum, a system smart enough to learn about the positive values of several-steps-removed conditioned stimuli can also learn many indicators of when those abstractions won’t lead to reward. These may be cortical representations of planning-but-not-doing, or other indicators in the cortex of the difference between reality and imagination. The weaker nature of simulation representations may be enough to distinguish, and it should certainly be enough to ensure that real rewards and punishments always have a stronger influence, making imagination ultimately under the control of reality.
  If you’ve spent the afternoon wireheading by daydreaming about how delicious that fresh meat is, you’ll be very hungry in the evening. Something has gone very wrong, in much the same way as if you chose to hunt for game where there is none. In both cases, the system is going to have to learn where the wrong decision was made and the wrong strategy was followed. If you’re out of a job and out of money because you’ve spent months arguing with strangers on the internet about your beloved concept of freedom and the type of political policy that will provide it, something has similarly gone wrong. You might downgrade your estimated value of those concepts and theories, and you might downgrade the value of arguing on the internet with strangers all day.
  The same problem arises with any use of value estimates to make prediction-based decisions. It could be that the dopamine system is not involved in these predictions. But given the data that dopamine spiking activity is ubiquitous^[1], even when no physical or social rewards are present it seems likely to me that the system is working the same way in abstract domains as it is known to work in concrete ones.
  1. ^
    I need to find this paper, but don’t have time right now. The finding was that rodents exploring a new home cage exhibit dopamine spiking activity something like once a second or so on average. I have a clear memory of the claim, but didn’t evaluate the methods closely enough to be sure the claim was well supported. If I’m wrong about this, I’d change my mind about the system working this way.
    This could be explained by curiosity as an innate reward signal, and that might well be part of the story. But you’d still need to explain why animals don’t die by exploring instead of finding food. The same core explanation works for both: imagination and curiosity are both constrained to be weaker signals than real physical rewards.