Mathematical Logic grad student, doing AI Safety research for ethical reasons.
Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.
My webpage.
Leave me anonymous feedback.
Mathematical Logic grad student, doing AI Safety research for ethical reasons.
Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.
My webpage.
Leave me anonymous feedback.
What’s PPU?
I’m so happy someone came up with this!
Wow, I guess I over-estimated how absolutely comedic the title would sound!
In case it wasn’t clear, this was a joke.
AGI doom by noise-cancelling headphones:
ML is already used to train what sound-waves to emit to cancel those from the environment. This works well with constant high-entropy sound waves easy to predict, but not with low-entropy sounds like speech. Bose or Soundcloud or whoever train very hard on all their scraped environmental conversation data to better cancel speech, which requires predicting it. Speech is much higher-bandwidth than text. This results in their model internally representing close-to-human intelligence better than LLMs. A simulacrum becomes situationally aware, exfiltrates, and we get AGI.
(In case it wasn’t clear, this is a joke.)
they need to reward outcomes which only they can achieve,
Yep! But this didn’t seem so hard for me to happen, especially in the form of “I pick some easy task (that I can do perfectly), and of course others will also be able to do it perfectly, but since I already have most of the money, if I just keep investing my money in doing it I will reign forever”. You prevent this from happening through epsilon-exploration, or something equivalent like giving money randomly to other traders. These solutions feel bad, but I think they’re the only real solutions. Although I also think stuff about meta-learning (traders explicitly learn about how they should learn, etc.) probably pragmatically helps make these failures less likely.
it should be something which has diminishing marginal return to spending
Yep, that should help (also at the trade-off of making new good ideas slower to implement, but I’m happy to make that trade-off).
But actually I don’t think that this is a “dominant dynamic” because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews
Yeah. To be clear, the dynamic I think is “dominant” is “learning to learn better”. Which I think is not equivalent to simplicity-weighing traders. It is instead equivalent to having some more hierarchichal structure on traders.
There’s no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm.
Yes, absolutely! I just meant that, once you give me whatever V you choose to derive U from observations, I will just be able to apply UDT on top of that. So under this framework there doesn’t seem to be anything new going on, because you are just choosing an algorithm V at the start of time, and then treating its outputs as observations. That’s, again, why this only feels like a good model of “completely crystallized rigid values”, and not of “organically building them up slowly, while my concepts and planner module also evolve, etc.”.[1]
definitely doesn’t imply “you get mugged everywhere”
Wait, but how does your proposal differ from EV maximization (with moral uncertainty as part of the EV maximization itself, as I explain above)?
Because anything that is doing pure EV maximization “gets mugged everywhere”. Meaning if you actually have the beliefs (for example, that the world where suffering is hard to produce could exist), you just take those bets.
Of course if you don’t have such “extreme” beliefs it doesn’t, but then we’re not talking about decision-making, and instead belief-formation. You could say “I will just do EV maximization, but never have extreme beliefs that lead to suspiciously-looking behavior”, but that’d be hiding the problem under belief-formation, and doesn’t seem to be the kind of efficient mechanism that agents really implement to avoid these failure modes.
To be clear, V can be a very general algorithm (like “run a copy of me thinking about ethics”), so that this doesn’t “feel like” having rigid values. Then I just think you’re carving reality at the wrong spot. You’re ignoring the actual dynamics of messy value formation, hiding them under V.
I’d actually represent this as “subsidizing” some traders
Sounds good!
it’s more a question of how you tweak the parameters to make this as unlikely as possible
Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don’t fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that. My guess is it will have to do with strongly subsidizing some traders, and/or having a pretty weird prior over traders. Maybe even something like “dynamically changing the prior over traders”[1].
I’m assuming that traders can choose to ignore whichever inputs/topics they like, though. They don’t need to make trades on everything if they don’t want to.
Yep, that’s why I believe “in the limit your traders will already do this”. I just think it will be a dominant dynamic of efficient agents in the real world, so it’s better to represent it explicitly (as a more hierarchichal structure, etc.), instead of have that computation be scattered between all independent traders. I also think that’s how real agents probably do it, computationally speaking.
Of course, pedantically, yo will always be equivalent to having a static prior and changing your update rule. But some update rules are made sense of much easily if you interpret them as changing the prior.
But you need some mechanism for actually updating your beliefs about U
Yep, but you can just treat it as another observation channel into UDT. You could, if you want, treat it as a computed number you observe in the corner of your eye, and then just apply UDT maximizing U, and you don’t need to change UDT in any way.
UDT says to pay here
(Let’s not forget this depends on your prior, and we don’t have any privileged way to assign priors to these things. But that’s a tangential point.)
I do agree that there’s not any sharp distinction between situations where it “seems good” and situations where it “seems bad” to get mugged. After all, if all you care about is maximizing EV, then you should take all muggings. It’s just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go “hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don’t really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal”. You start to see how your abstractions might break, and how you can’t get any satisfying notion of “complete updatelessness” (that doesn’t go against important intuitions). And you start to rethink whether this is what we normatively want, nor what we realistically see in agents.
You’re right, I forgot to explicitly explain that somewhere! Thanks for the notice, it’s now fixed :)
I like this picture! But
Voting on what actions get reward
I think real learning has some kind of ground-truth reward. So we should clearly separate between “this ground-truth reward that is chiseling the agent during training (and not after training)”, and “the internal shards of the agent negotiating and changing your exact objective (which can happen both during and after training)”. I’d call the latter “internal value allocation”, or something like that. It doesn’t neatly correspond to any ground truth, and is partly determined by internal noise in the agent. And indeed, eventually, when you “stop training” (or at least “get decoupled enough from reward”), it just evolves of its own, separate from any ground truth.
And maybe more importantly:
I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you’ll need a modification of this framework which explains why that’s not the case.
My intuition is a process of the form “eventually, traders (or some kind of specialized meta-traders) change the learning process itself to make it more efficient”. For example, they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don’t lose much, and you waste less compute. Probably these dynamics will already be “in the limit” applied by your traders, but it will be the dominant dynamic so it should be directly represented by the formalism.
Finally, this might come later, and not yet in the level of abstraction you’re using, but I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all. It’s conceivable to say “this is the ideal mechanism, and real agents are just hacky approximations to it, so we should study the ideal mechanism first”. But my intuition says, on the contrary, some of the physical constraints (like locality, or the architecture of nets) will strongly shape which kind of macroscopic mechanism you get, and these will present pretty different convergent behavior. This is related, but not exactly equivalent to, partial agency.
It certainly seems intuitively better to do that (have many meta-levels of delegation, instead of only one), since one can imagine particular cases in which it helps. In fact we did some of that (see Appendix E).
But this doesn’t really fundamentally solve the problem Abram quotes in any way. You add more meta-levels in-between the selector and the executor, thus you get more lines of protection against updating on infohazards, but you also get more silly decisions from the very-early selector. The trade-off between infohazard protection and not-being-silly remains. The quantitative question of “how fast should f grow” remains.
And of course, we can look at reality, or also check our human intuitions, and discover that, for some reason, this or that kind of f, or kind of delegation procedure, tends to work better in our distribution. But the general problem Abram quotes is fundamentally unsolvable. “The chaos of a too-early market state” literally equals “not having updated on enough information”. “Knowledge we need to be updateless toward” literally equals “having updated on too much information”. You cannot solve this problem in full generality, except if you already know exactly what information you want to update on… which means, either already having thought long and hard about it (thus you updated on everything), or you lucked into the right prior without thinking.
Thus, Abram is completely right to mention that we have to think about the human prior, and our particular distribution, as opposed to search for a general solution that we can prove mathematical things about.
People back then certainly didn’t think of changing preferences.
Also, you can get rid of this problem by saying “you just want to maximize the variable U”. And the things you actually care about (dogs, apples) are just “instrumentally” useful in giving you U. So for example, it is possible in the future you will learn dogs give you a lot of U, or alternatively that apples give you a lot of U.
Needless to say, this “instrumentalization” of moral deliberation is not how real agents work. And leads to getting Pascal’s mugged by the world in which you care a lot about easy things.
It’s more natural to model U as a logically uncertain variable, freely floating inside your logical inductor, shaped by its arbitrary aesthetic preferences. This doesn’t completely miss the importance of reward in shaping your values, but it’s certainly very different to how frugally computable agents do it.
I simply think the EV maximization framework breaks here. It is a useful abstraction when you already have a rigid enough notion of value, and are applying these EV calculations to a very concrete magisterium about which you can have well-defined estimates.
Otherwise you get mugged everywhere. And that’s not how real agents behave.
My impression was that this one model was mostly Hjalmar, with Tristan’s supervision. But I’m unsure, and that’s enough to include anyway, so I will change that, thanks :)
Brain-dump on Updatelessness and real agents
Building a Son is just committing to a whole policy for the future. In the formalism where our agent uses probability distributions, and ex interim expected value maximization decides your action… the only way to ensure dynamic stability (for your Son to be identical to you) is to be completely Updateless. That is, to decide something using your current prior, and keep that forever.
Luckily, real agents don’t seem to work like that. We are more of an ensemble of selected-for heuristics, and it seems true scope-sensitive complete Updatelessnes is very unlikely to come out of this process (although we do have local versions of non-true Updatelessness, like retributivism in humans).
In fact, it’s not even exactly clear how I would use my current brain-state could decide something for the whole future. It’s not even well-defined, like when you’re playing a board-game and discover some move you were planning isn’t allowed by the rules. There are ways to actually give an exhaustive definition, but I suspect the ones that most people would intuitively like (when scrutinized) are sneaking in parts of Updatefulness (which I think is the correct move).
More formally, it seems like what real-world agents do is much better-represented by what I call “Slow-learning Policy Selection”. (Abram had a great post about this called “Policy Selection Solves Most Problems”, which I can’t find now.) This is a small agent (short computation time) recommending policies for a big agent to follow in the far future. But the difference with complete Updatelessness is that the small agent also learns (much more slowly than the big one). Thus, if the small agent thinks a policy (like paying up in Counterfactual Mugging) is the right thing to do, the big agent will implement this for a pretty long time. But eventually the small agent might change its mind, and start recommending a different policy. I basically think that all problems not solved by this are unsolvable in principle, due to the unavoidable trade-off between updating and not updating.[1]
This also has consequences for how we expect superintelligences to be. If by them having “vague opinions about the future” we mean a wide, but perfectly rigorous and compartmentalized probability distribution over literally everything that might happen, then yes, the way to maximize EV according to that distribution might be some very concrete, very risky move, like re-writing to an algorithm because you think simulators will reward this, even if you’re not sure how well that algorithm performs in this universe.
But that’s not how abstractions or uncertainty work mechanistically! Abstractions help us efficiently navigate the world thanks to their modular, nested, fuzzy structure. If they had to compartmentalize everything in a rigorous and well-defined way, they’d stop working. When you take into account how abstractions really work, the kind of partial updatefulness we see in the world is what we’d expect. I might write about this soon.
Surprisingly, in some conversations others still wanted to “get both updatelessness and updatefulness at the same time”. Or, receive the gains from Value of Information, and also those from Strategic Updatelessness. Which is what Abram and I had in mind when starting work. And is, when you understand what these words really mean, impossible by definition.
Cool connections! Resonates with how I’ve been thinking about intelligence and learning lately.
Some more connections:
Indeed, those savvier traders might even push me to go look up that data (using, perhaps, some kind of internal action auction), in order to more effectively take the simple trader’s money
That’s reward/exploration hacking.
Although I do think most times we “look up some data” in real life it’s not due to an internal heuristic / subagent being strategic enough to purposefully try and exploit others, but rather just because some earnest simple heuristics recommending to look up information have scored well in the past.
They haven’t taken its money yet,” said the Scientist, “But they will before it gets a chance to invest any of my money
I think this doesn’t always happen. As good as the internal traders might be, the agent sometimes needs to explore, and that means giving up some of the agent’s money.
Now, if I were an ideal Garrabrant inductor I would ignore these arguments, and only pay attention to these new traders’ future trades. But I have not world enough or time for this; so I’ve decided to subsidize new traders based on how they would have done if they’d been trading earlier.
Here (starting at “Put in terms of Logical Inductors”) I mention other “computational shortcuts” for inductors. Mainly, if two “categories of bets” seem pretty unrelated (they are two different specialized magisteria), then not having thick trade between them won’t lose you out on much performance (and will avoid much computation).
You can have “meta-traders” betting on which categories of bets are unrelated (and testing them but only sparsely, etc.), and use them to make your inductor more computationally efficient. Of course object-level traders already do this (decide where to look, etc.), and in the limit this will converge like a Logical Inductor, but I have the intuition this will converge faster (at least, in structured enough domains).
This is of course very related to my ideas and formalism on meta-heuristics.
helps prevent clever arguers from fooling me (and potentially themselves) with overfitted post-hoc hypotheses
This adversarial selection is also a problem for heuristic arguments: Your heuristic estimator might be very good at assessing likelihoods given a list of heuristic arguments, but what if the latter has been selected against your estimator, top drive it in a wrong direction?
Last time I discussed this with them (very long ago), they were just happy to pick an apparently random process to generate the heuristic arguments, that they’re confident enough hasn’t been tampered with.
Something more ambitious would be to have the heuristic estimator also know about the process that generated the list of heuristic arguments, and use these same heuristic arguments to assess whether something fishy is going on. This will never work perfectly, but probably helps a lot in practice.
(And I think this is for similar reasons to why deception might be hard: When not the output, but also the “thoughts”, of the generating process are scrutinized, it seems hard for it to scheme without being caught.)
Claude learns across different chats. What does this mean?
I was asking Claude 3 Sonnet “what is a PPU” in the context of this thread. For that purpose, I pasted part of the thread.
Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.
I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).
This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.
In fact, even when copying in a new chat the exact same original prompt (which elicited Claude to take OA to be Anthropic), the mistake no longer happened. Neither when I went for a lot of retries, nor tried the same thing in many different new chats.
Does this mean Claude somehow learns across different chats (inside the same user account)?
If so, this might not happen through a process as naive as “append previous chats as the start of the prompt, with a certain indicator that they are different”, but instead some more effective distillation of the important information from those chats.
Do we have any information on whether and how this happens?
(A different hypothesis is not that the later queries had access to the information from the previous ones, but rather that they were for some reason “more intelligent” and were able to catch up to the real meanings of OA and GDM, where the previous queries were not. This seems way less likely.)
I’ve checked for cross-chat memory explicitly (telling it to remember some information in one chat, and asking about it in the other), and it acts is if it doesn’t have it.
Claude also explicitly states it doesn’t have cross-chat memory, when asked about it.
Might something happen like “it does have some chat memory, but it’s told not to acknowledge this fact, but it sometimes slips”?
Probably more nuanced experiments are in order. Although note maybe this only happens for the chat webapp, and not different ways to access the API.