Richard_Ngo comments on Richard Ngo’s Shortform

Richard_Ngo 21 Mar 2024 0:02 UTC
2 points
0
I think real learning has some kind of ground-truth reward.
I’d actually represent this as “subsidizing” some traders. For example, humans have a social-status-detector which is hardwired to our reward systems. One way to implement this is just by taking a trader which is focused on social status and giving it a bunch of money. I think this is also realistic in the sense that our human hardcoded rewards can be seen as (fairly dumb) subagents.
I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you’ll need a modification of this framework which explains why that’s not the case.
I think this happens in humans—e.g. we fall into cults, we then look for evidence that the cult is correct, etc etc. So I don’t think this is actually a problem that should be ruled out—it’s more a question of how you tweak the parameters to make this as unlikely as possible. (One reason it can’t be ruled out: it’s always possible for an agent to end up in a belief state where it expects that exploration will be very severely punished, which drives the probability of exploration arbitrarily low.)
they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don’t lose much, and you waste less compute
I’m assuming that traders can choose to ignore whichever inputs/topics they like, though. They don’t need to make trades on everything if they don’t want to.
I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all
Yeah, this is why I’m interested in understanding how sub-markets can be aggregated into markets, sub-auctions into auctions, sub-elections into elections, etc.
- Martín Soto 21 Mar 2024 0:32 UTC
  1 point
  0
  Parent
  I’d actually represent this as “subsidizing” some traders
  Sounds good!
  it’s more a question of how you tweak the parameters to make this as unlikely as possible
  Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don’t fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that. My guess is it will have to do with strongly subsidizing some traders, and/or having a pretty weird prior over traders. Maybe even something like “dynamically changing the prior over traders”^[1].
  I’m assuming that traders can choose to ignore whichever inputs/topics they like, though. They don’t need to make trades on everything if they don’t want to.
  Yep, that’s why I believe “in the limit your traders will already do this”. I just think it will be a dominant dynamic of efficient agents in the real world, so it’s better to represent it explicitly (as a more hierarchichal structure, etc.), instead of have that computation be scattered between all independent traders. I also think that’s how real agents probably do it, computationally speaking.
  1. ^
    Of course, pedantically, yo will always be equivalent to having a static prior and changing your update rule. But some update rules are made sense of much easily if you interpret them as changing the prior.
  - Richard_Ngo 21 Mar 2024 0:53 UTC
    2 points
    0
    Parent
    Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don’t fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that.
    Ah, I see. In that case I think I disagree that it happens “by default” in this model. A few dynamics which prevent it:
    If the wealthy trader makes reward easier to get, then the price of actions will go up accordingly (because other traders will notice that they can get a lot of reward by winning actions). So in order for the wealthy trader to keep making money, they need to reward outcomes which only they can achieve, which seems a lot harder.
    I don’t yet know how traders would best aggregate votes into a reward function, but it should be something which has diminishing marginal return to spending, i.e. you can’t just spend 100x as much to get 100x higher reward on your preferred outcome. (Maybe quadratic voting?)
    Other traders will still make money by predicting sensory observations. Now, perhaps the wealthy trader could avoid this by making observations as predictable as possible (e.g. going into a dark room where nothing happens—kinda like depression, maybe?) But this outcome would be assigned very low reward by most other traders, so it only works once a single trader already has a large proportion of the wealth.
    Yep, that’s why I believe “in the limit your traders will already do this”. I just think it will be a dominant dynamic of efficient agents in the real world, so it’s better to represent it explicitly
    IMO the best way to explicitly represent this is via a bias towards simpler traders, who will in general pay attention to fewer things.
    But actually I don’t think that this is a “dominant dynamic” because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews. And so even if you start off with simple traders who pay attention to fewer things, you’ll end up with these big worldviews that have opinions on everything. (These are what I call frames here.)
    - Martín Soto 21 Mar 2024 1:09 UTC
      1 point
      0
      Parent
      they need to reward outcomes which only they can achieve,
      Yep! But this didn’t seem so hard for me to happen, especially in the form of “I pick some easy task (that I can do perfectly), and of course others will also be able to do it perfectly, but since I already have most of the money, if I just keep investing my money in doing it I will reign forever”. You prevent this from happening through epsilon-exploration, or something equivalent like giving money randomly to other traders. These solutions feel bad, but I think they’re the only real solutions. Although I also think stuff about meta-learning (traders explicitly learn about how they should learn, etc.) probably pragmatically helps make these failures less likely.
      it should be something which has diminishing marginal return to spending
      Yep, that should help (also at the trade-off of making new good ideas slower to implement, but I’m happy to make that trade-off).
      But actually I don’t think that this is a “dominant dynamic” because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews
      Yeah. To be clear, the dynamic I think is “dominant” is “learning to learn better”. Which I think is not equivalent to simplicity-weighing traders. It is instead equivalent to having some more hierarchichal structure on traders.