Here is the best toy model I currently have for rational agents. Alas, it is super messy and hacky, but better than nothing. I’ll call it the BAVM model; the one-sentence summary is “internal traders concurrently bet on beliefs, auction actions, vote on values, and merge minds”. There’s little novel here, I’m just throwing together a bunch of ideas from other people (especially Scott Garrabrant and Abram Demski).
In more detail, the three main components are:
A prediction market
An action auction
A value election
You also have some set of traders, who can simultaneously trade on any combination of these three. Traders earn money in two ways:
Making accurate predictions about future sensory experiences on the market.
Taking actions which lead to reward or increase the agent’s expected future value.
They spend money in three ways:
Bidding to control the agent’s actions for the next N timesteps.
Voting on what actions get reward and what states are assigned value.
Running the computations required to figure out all these trades.
Values are therefore dominated by whichever traders earn money from predictions or actions, who will disproportionately vote for values that are formulated in the same ontologies they use for prediction/action, since that’s simpler than using different ontologies.
The last component is that it costs traders money to do computation. The way they can reduce this is by finding other traders who do similar computations as them, and then merging into a single trader. I am very interested in better understanding what a merging process like this might look like, though it seems pretty intractable in general because it will depend a lot on the internal details of the traders. (So perhaps a more principled approach here is to instead work top-down, figuring out what sub-markets or sub-auctions look like?)
I wonder if there’s a loopiness here is which breaks the setup (the expectation I’m guessing is relative to the prediction markets probabilities? Though it seems like the market is over sensory experiences but the values are over world states in general, so maybe I’m missing something). But it seems like if I take an action and move the market at the same time, I might be able to extract a bunch of extra money and acquire outsize control.
Bidding to control the agent’s actions for the next N timesteps
This seems like it’s wasteful relative to contributing to a pool that bids on action A (or short-term policy P). I guess coordination is hard if you’re just contributing to the pool though, and all connects to the merging process you describe.
I’ve been studying and thinking about the physical side of this phenomenon in neuroscience recently. There are groups of columns of neurons in the cortex that form temporary voting blocks, regarding whatever subject that particular Brodmann area focuses on. These alternating groups have to deal with physical limits of how many groups the regions can stably divide into, which limits the number of active distinct hypotheses or ‘traders’ there can be in a given area at a given time. Unclear exactly what the max is, and it depends on the cortical region in question, but generally 6-9 is the approximate max (not coincidentally the number of distinct ‘chunks’ we can hold in active short term memory). Also, there is a tendency for noise to collapse too similar of traders/hypotheses/firing-groups to fall back into synchrony/agreement with each other and thus collapse back down to a baseline of two competing hypotheses. These hypotheses/firing-groups/traders are pushed into existence or pushed into merging not just by their own ‘bids’ but also by the evidence coming in from other brain areas or senses. I don’t think that current day neuroscience has all the details yet (although I certainly don’t have the full picture of all relevant papers in neuroscience!).
I think there are probably a lot of ways to build rational agents. The idea that general intelligence is hard in any absolute sense may be a biased by wanting to believe we’re special, and for AI workers, that our work is special and difficult.
I note that AI economies like this will often have explosively better credit assignment for information production than human economies can. Artificial agents can be copied or rolled back (erase memories), which makes it possible to reverse the receipt of information if an assessor concludes with a price that the seller considers too low for a deal. In human economies, that’s impossible, you can’t send a clone to value a piece of information then delete them if you decide not to buy that information (that’s too expensive/illegal) nor can you wipe their memory of the information (or, we don’t know how to do that), so the very basic requirement for trade, assessment prior to purchase, is not possible in human economies, so information doesn’t get priced accurately and it has to be treated as a public good.
When implementing this (internal privacy) in a multi-agent architecture, though, make sure to take measures to prevent the formation of monopolies, I feel like information is kind of an increasing returns type of good, yeah? The more you have the more you can do with it. It could quickly stop being multi-agent, and at worst, the monopoly could consolidate enough political power to manipulate the EV estimators and reward hack. In theory those economies shouldn’t interact. But it’s impossible to totally prevent it. The EV estimators are receiving big sets of action proposals from the decisionmakers and the decisionmakers will see which action proposal the EV estimators end up choosing.
Artificial agents can be copied or rolled back (erase memories), which makes it possible to reverse the receipt of information if an assessor concludes with a price that the seller considers too low for a deal.
Yepp, very good point. Am working on a short story about this right now.
My guess is that understanding merging is the key to most prediction-of-behavior issues (things that motivated and also foiled UDT, but not limited to known-in-advance preference setting). Two agents can coordinate if they are the same, or reasoning about each other’s behavior, but in general they can be too complicated to clearly understand each other or themselves, can inadvertently diagonalize such attempts into impossibility, or even fail to be sufficiently aware of each other to start reasoning about each other specifically.
It might be useful to formulate smaller computations (contracts/adjudicators) that facilitate coordination between different agents by being shared between them, with the bigger agents acting as parts of environments for the contracts and setting up incentives for them, while the contracts can themselves engage in decision making within those environments. Contracts coordinate by being shared and acting with strategicness across relevant agents (they should be something like common knowledge), and it’s feasible for agents to find/construct some shared contracts as a result of them being much simpler than agents that host them. Learning of contracts doesn’t need to start with targeting coordination with other big agents, as active contracts screen off the other agents they facilitate coordination with.
Using contracts requires the big agents to make decisions about policies that affect the contracts updatelessly with respect to how the contracts end up behaving. That is, a contract should be able to know these policies, and the policies should describe responses to possible behaviors of a contract without themselves changing (once the contract computes more of its behavior), enabling the contract to do decision making in the environment of these policies. This corresponds to committing to abide by the contract. Assurance contracts (that start their tenure by checking that the commitments of all parties are actually in place) are especially important, allowing things like cooperation in PD.
If traders can get access to control panel for actions of the external agent AND they profit from accurately predicting its observations, then wouldn’t the best strategy be “create as much chaos as possible that is only predictable to me, its creator”. So, traders that value ONLY accurate predictions will get the advantage?
I think real learning has some kind of ground-truth reward. So we should clearly separate between “this ground-truth reward that is chiseling the agent during training (and not after training)”, and “the internal shards of the agent negotiating and changing your exact objective (which can happen both during and after training)”. I’d call the latter “internal value allocation”, or something like that. It doesn’t neatly correspond to any ground truth, and is partly determined by internal noise in the agent. And indeed, eventually, when you “stop training” (or at least “get decoupled enough from reward”), it just evolves of its own, separate from any ground truth.
And maybe more importantly:
I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you’ll need a modification of this framework which explains why that’s not the case.
My intuition is a process of the form “eventually, traders (or some kind of specialized meta-traders) change the learning process itself to make it more efficient”. For example, they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don’t lose much, and you waste less compute. Probably these dynamics will already be “in the limit” applied by your traders, but it will be the dominant dynamic so it should be directly represented by the formalism.
Finally, this might come later, and not yet in the level of abstraction you’re using, but I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all. It’s conceivable to say “this is the ideal mechanism, and real agents are just hacky approximations to it, so we should study the ideal mechanism first”. But my intuition says, on the contrary, some of the physical constraints (like locality, or the architecture of nets) will strongly shape which kind of macroscopic mechanism you get, and these will present pretty different convergent behavior. This is related, but not exactly equivalent to, partial agency.
I think real learning has some kind of ground-truth reward.
I’d actually represent this as “subsidizing” some traders. For example, humans have a social-status-detector which is hardwired to our reward systems. One way to implement this is just by taking a trader which is focused on social status and giving it a bunch of money. I think this is also realistic in the sense that our human hardcoded rewards can be seen as (fairly dumb) subagents.
I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you’ll need a modification of this framework which explains why that’s not the case.
I think this happens in humans—e.g. we fall into cults, we then look for evidence that the cult is correct, etc etc. So I don’t think this is actually a problem that should be ruled out—it’s more a question of how you tweak the parameters to make this as unlikely as possible. (One reason it can’t be ruled out: it’s always possible for an agent to end up in a belief state where it expects that exploration will be very severely punished, which drives the probability of exploration arbitrarily low.)
they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don’t lose much, and you waste less compute
I’m assuming that traders can choose to ignore whichever inputs/topics they like, though. They don’t need to make trades on everything if they don’t want to.
I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all
Yeah, this is why I’m interested in understanding how sub-markets can be aggregated into markets, sub-auctions into auctions, sub-elections into elections, etc.
I’d actually represent this as “subsidizing” some traders
Sounds good!
it’s more a question of how you tweak the parameters to make this as unlikely as possible
Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don’t fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that. My guess is it will have to do with strongly subsidizing some traders, and/or having a pretty weird prior over traders. Maybe even something like “dynamically changing the prior over traders”[1].
I’m assuming that traders can choose to ignore whichever inputs/topics they like, though. They don’t need to make trades on everything if they don’t want to.
Yep, that’s why I believe “in the limit your traders will already do this”. I just think it will be a dominant dynamic of efficient agents in the real world, so it’s better to represent it explicitly (as a more hierarchichal structure, etc.), instead of have that computation be scattered between all independent traders. I also think that’s how real agents probably do it, computationally speaking.
Of course, pedantically, yo will always be equivalent to having a static prior and changing your update rule. But some update rules are made sense of much easily if you interpret them as changing the prior.
Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don’t fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that.
Ah, I see. In that case I think I disagree that it happens “by default” in this model. A few dynamics which prevent it:
If the wealthy trader makes reward easier to get, then the price of actions will go up accordingly (because other traders will notice that they can get a lot of reward by winning actions). So in order for the wealthy trader to keep making money, they need to reward outcomes which only they can achieve, which seems a lot harder.
I don’t yet know how traders would best aggregate votes into a reward function, but it should be something which has diminishing marginal return to spending, i.e. you can’t just spend 100x as much to get 100x higher reward on your preferred outcome. (Maybe quadratic voting?)
Other traders will still make money by predicting sensory observations. Now, perhaps the wealthy trader could avoid this by making observations as predictable as possible (e.g. going into a dark room where nothing happens—kinda like depression, maybe?) But this outcome would be assigned very low reward by most other traders, so it only works once a single trader already has a large proportion of the wealth.
Yep, that’s why I believe “in the limit your traders will already do this”. I just think it will be a dominant dynamic of efficient agents in the real world, so it’s better to represent it explicitly
IMO the best way to explicitly represent this is via a bias towards simpler traders, who will in general pay attention to fewer things.
But actually I don’t think that this is a “dominant dynamic” because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews. And so even if you start off with simple traders who pay attention to fewer things, you’ll end up with these big worldviews that have opinions on everything. (These are what I call frames here.)
they need to reward outcomes which only they can achieve,
Yep! But this didn’t seem so hard for me to happen, especially in the form of “I pick some easy task (that I can do perfectly), and of course others will also be able to do it perfectly, but since I already have most of the money, if I just keep investing my money in doing it I will reign forever”. You prevent this from happening through epsilon-exploration, or something equivalent like giving money randomly to other traders. These solutions feel bad, but I think they’re the only real solutions. Although I also think stuff about meta-learning (traders explicitly learn about how they should learn, etc.) probably pragmatically helps make these failures less likely.
it should be something which has diminishing marginal return to spending
Yep, that should help (also at the trade-off of making new good ideas slower to implement, but I’m happy to make that trade-off).
But actually I don’t think that this is a “dominant dynamic” because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews
Yeah. To be clear, the dynamic I think is “dominant” is “learning to learn better”. Which I think is not equivalent to simplicity-weighing traders. It is instead equivalent to having some more hierarchichal structure on traders.
Here is the best toy model I currently have for rational agents. Alas, it is super messy and hacky, but better than nothing. I’ll call it the BAVM model; the one-sentence summary is “internal traders concurrently bet on beliefs, auction actions, vote on values, and merge minds”. There’s little novel here, I’m just throwing together a bunch of ideas from other people (especially Scott Garrabrant and Abram Demski).
In more detail, the three main components are:
A prediction market
An action auction
A value election
You also have some set of traders, who can simultaneously trade on any combination of these three. Traders earn money in two ways:
Making accurate predictions about future sensory experiences on the market.
Taking actions which lead to reward or increase the agent’s expected future value.
They spend money in three ways:
Bidding to control the agent’s actions for the next N timesteps.
Voting on what actions get reward and what states are assigned value.
Running the computations required to figure out all these trades.
Values are therefore dominated by whichever traders earn money from predictions or actions, who will disproportionately vote for values that are formulated in the same ontologies they use for prediction/action, since that’s simpler than using different ontologies.
The last component is that it costs traders money to do computation. The way they can reduce this is by finding other traders who do similar computations as them, and then merging into a single trader. I am very interested in better understanding what a merging process like this might look like, though it seems pretty intractable in general because it will depend a lot on the internal details of the traders. (So perhaps a more principled approach here is to instead work top-down, figuring out what sub-markets or sub-auctions look like?)
I wonder if there’s a loopiness here is which breaks the setup (the expectation I’m guessing is relative to the prediction markets probabilities? Though it seems like the market is over sensory experiences but the values are over world states in general, so maybe I’m missing something). But it seems like if I take an action and move the market at the same time, I might be able to extract a bunch of extra money and acquire outsize control.
This seems like it’s wasteful relative to contributing to a pool that bids on action A (or short-term policy P). I guess coordination is hard if you’re just contributing to the pool though, and all connects to the merging process you describe.
I’ve been studying and thinking about the physical side of this phenomenon in neuroscience recently. There are groups of columns of neurons in the cortex that form temporary voting blocks, regarding whatever subject that particular Brodmann area focuses on. These alternating groups have to deal with physical limits of how many groups the regions can stably divide into, which limits the number of active distinct hypotheses or ‘traders’ there can be in a given area at a given time. Unclear exactly what the max is, and it depends on the cortical region in question, but generally 6-9 is the approximate max (not coincidentally the number of distinct ‘chunks’ we can hold in active short term memory). Also, there is a tendency for noise to collapse too similar of traders/hypotheses/firing-groups to fall back into synchrony/agreement with each other and thus collapse back down to a baseline of two competing hypotheses. These hypotheses/firing-groups/traders are pushed into existence or pushed into merging not just by their own ‘bids’ but also by the evidence coming in from other brain areas or senses. I don’t think that current day neuroscience has all the details yet (although I certainly don’t have the full picture of all relevant papers in neuroscience!).
I think there are probably a lot of ways to build rational agents. The idea that general intelligence is hard in any absolute sense may be a biased by wanting to believe we’re special, and for AI workers, that our work is special and difficult.
Can new traders be “spawned”?
Yepp, as in Logical Induction, new traders get spawned over time (in some kind of simplicity-weighted ordering).
I note that AI economies like this will often have explosively better credit assignment for information production than human economies can. Artificial agents can be copied or rolled back (erase memories), which makes it possible to reverse the receipt of information if an assessor concludes with a price that the seller considers too low for a deal. In human economies, that’s impossible, you can’t send a clone to value a piece of information then delete them if you decide not to buy that information (that’s too expensive/illegal) nor can you wipe their memory of the information (or, we don’t know how to do that), so the very basic requirement for trade, assessment prior to purchase, is not possible in human economies, so information doesn’t get priced accurately and it has to be treated as a public good.
When implementing this (internal privacy) in a multi-agent architecture, though, make sure to take measures to prevent the formation of monopolies, I feel like information is kind of an increasing returns type of good, yeah? The more you have the more you can do with it. It could quickly stop being multi-agent, and at worst, the monopoly could consolidate enough political power to manipulate the EV estimators and reward hack. In theory those economies shouldn’t interact. But it’s impossible to totally prevent it. The EV estimators are receiving big sets of action proposals from the decisionmakers and the decisionmakers will see which action proposal the EV estimators end up choosing.
Yepp, very good point. Am working on a short story about this right now.
Nice!
For comparison, this “Great Map of the Mind” is basically the standard academic philosophy picture.
My guess is that understanding merging is the key to most prediction-of-behavior issues (things that motivated and also foiled UDT, but not limited to known-in-advance preference setting). Two agents can coordinate if they are the same, or reasoning about each other’s behavior, but in general they can be too complicated to clearly understand each other or themselves, can inadvertently diagonalize such attempts into impossibility, or even fail to be sufficiently aware of each other to start reasoning about each other specifically.
It might be useful to formulate smaller computations (contracts/adjudicators) that facilitate coordination between different agents by being shared between them, with the bigger agents acting as parts of environments for the contracts and setting up incentives for them, while the contracts can themselves engage in decision making within those environments. Contracts coordinate by being shared and acting with strategicness across relevant agents (they should be something like common knowledge), and it’s feasible for agents to find/construct some shared contracts as a result of them being much simpler than agents that host them. Learning of contracts doesn’t need to start with targeting coordination with other big agents, as active contracts screen off the other agents they facilitate coordination with.
Using contracts requires the big agents to make decisions about policies that affect the contracts updatelessly with respect to how the contracts end up behaving. That is, a contract should be able to know these policies, and the policies should describe responses to possible behaviors of a contract without themselves changing (once the contract computes more of its behavior), enabling the contract to do decision making in the environment of these policies. This corresponds to committing to abide by the contract. Assurance contracts (that start their tenure by checking that the commitments of all parties are actually in place) are especially important, allowing things like cooperation in PD.
If traders can get access to control panel for actions of the external agent AND they profit from accurately predicting its observations, then wouldn’t the best strategy be “create as much chaos as possible that is only predictable to me, its creator”. So, traders that value ONLY accurate predictions will get the advantage?
I like this picture! But
I think real learning has some kind of ground-truth reward. So we should clearly separate between “this ground-truth reward that is chiseling the agent during training (and not after training)”, and “the internal shards of the agent negotiating and changing your exact objective (which can happen both during and after training)”. I’d call the latter “internal value allocation”, or something like that. It doesn’t neatly correspond to any ground truth, and is partly determined by internal noise in the agent. And indeed, eventually, when you “stop training” (or at least “get decoupled enough from reward”), it just evolves of its own, separate from any ground truth.
And maybe more importantly:
I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you’ll need a modification of this framework which explains why that’s not the case.
My intuition is a process of the form “eventually, traders (or some kind of specialized meta-traders) change the learning process itself to make it more efficient”. For example, they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don’t lose much, and you waste less compute. Probably these dynamics will already be “in the limit” applied by your traders, but it will be the dominant dynamic so it should be directly represented by the formalism.
Finally, this might come later, and not yet in the level of abstraction you’re using, but I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all. It’s conceivable to say “this is the ideal mechanism, and real agents are just hacky approximations to it, so we should study the ideal mechanism first”. But my intuition says, on the contrary, some of the physical constraints (like locality, or the architecture of nets) will strongly shape which kind of macroscopic mechanism you get, and these will present pretty different convergent behavior. This is related, but not exactly equivalent to, partial agency.
I’d actually represent this as “subsidizing” some traders. For example, humans have a social-status-detector which is hardwired to our reward systems. One way to implement this is just by taking a trader which is focused on social status and giving it a bunch of money. I think this is also realistic in the sense that our human hardcoded rewards can be seen as (fairly dumb) subagents.
I think this happens in humans—e.g. we fall into cults, we then look for evidence that the cult is correct, etc etc. So I don’t think this is actually a problem that should be ruled out—it’s more a question of how you tweak the parameters to make this as unlikely as possible. (One reason it can’t be ruled out: it’s always possible for an agent to end up in a belief state where it expects that exploration will be very severely punished, which drives the probability of exploration arbitrarily low.)
I’m assuming that traders can choose to ignore whichever inputs/topics they like, though. They don’t need to make trades on everything if they don’t want to.
Yeah, this is why I’m interested in understanding how sub-markets can be aggregated into markets, sub-auctions into auctions, sub-elections into elections, etc.
Sounds good!
Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don’t fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that. My guess is it will have to do with strongly subsidizing some traders, and/or having a pretty weird prior over traders. Maybe even something like “dynamically changing the prior over traders”[1].
Yep, that’s why I believe “in the limit your traders will already do this”. I just think it will be a dominant dynamic of efficient agents in the real world, so it’s better to represent it explicitly (as a more hierarchichal structure, etc.), instead of have that computation be scattered between all independent traders. I also think that’s how real agents probably do it, computationally speaking.
Of course, pedantically, yo will always be equivalent to having a static prior and changing your update rule. But some update rules are made sense of much easily if you interpret them as changing the prior.
Ah, I see. In that case I think I disagree that it happens “by default” in this model. A few dynamics which prevent it:
If the wealthy trader makes reward easier to get, then the price of actions will go up accordingly (because other traders will notice that they can get a lot of reward by winning actions). So in order for the wealthy trader to keep making money, they need to reward outcomes which only they can achieve, which seems a lot harder.
I don’t yet know how traders would best aggregate votes into a reward function, but it should be something which has diminishing marginal return to spending, i.e. you can’t just spend 100x as much to get 100x higher reward on your preferred outcome. (Maybe quadratic voting?)
Other traders will still make money by predicting sensory observations. Now, perhaps the wealthy trader could avoid this by making observations as predictable as possible (e.g. going into a dark room where nothing happens—kinda like depression, maybe?) But this outcome would be assigned very low reward by most other traders, so it only works once a single trader already has a large proportion of the wealth.
IMO the best way to explicitly represent this is via a bias towards simpler traders, who will in general pay attention to fewer things.
But actually I don’t think that this is a “dominant dynamic” because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews. And so even if you start off with simple traders who pay attention to fewer things, you’ll end up with these big worldviews that have opinions on everything. (These are what I call frames here.)
Yep! But this didn’t seem so hard for me to happen, especially in the form of “I pick some easy task (that I can do perfectly), and of course others will also be able to do it perfectly, but since I already have most of the money, if I just keep investing my money in doing it I will reign forever”. You prevent this from happening through epsilon-exploration, or something equivalent like giving money randomly to other traders. These solutions feel bad, but I think they’re the only real solutions. Although I also think stuff about meta-learning (traders explicitly learn about how they should learn, etc.) probably pragmatically helps make these failures less likely.
Yep, that should help (also at the trade-off of making new good ideas slower to implement, but I’m happy to make that trade-off).
Yeah. To be clear, the dynamic I think is “dominant” is “learning to learn better”. Which I think is not equivalent to simplicity-weighing traders. It is instead equivalent to having some more hierarchichal structure on traders.