this post was written by Tamsin Leake at Orthogonal.
thanks to Julia Persson and mesaoptimizer for their help putting it together.
no familiarity with the Evangelion anime is required to understand this post, and it pretty much doesnāt contain any spoilers.
this post explains the justification for, and the math formalization of, the QACI plan for formal-goal alignment. you might also be interested in its companion post, formalizing the QACI alignment formal-goal, which just covers the math in a more straightforward, bottom-up manner.
1. agent foundations & anthropics
š£ misato ā hi ritsuko! so, howās this alignment stuff going?
š” ritsuko ā well, i think iāve got an idea, but youāre not going to like it.
š¢ shinji ā thatās exciting! what is it?
š” ritsuko ā so, you know how in the sequences and superintelligence, yudkowsky and bostrom talk about how hard it is to fully formalize something which leads to nice things when maximized by a utility function? so much so that it serves as an exercise to think about oneās values and consistently realize how complex they are?
š” ritsuko ā ah, yes, the good old days when we believed this was the single obstacle to alignment.
š“ asuka barges into the room and exclaims ā hey, check this out! i found this fancy new theory on lesswrong about how āshards of valueā emerge in neural networks!
š“ asuka then walks away while muttering something about eiffel towers in rome and waluigi hyperstitionā¦
š” ritsuko ā indeed. these days, all these excited kids running around didnāt learn about AI safety by thinking really hard about what agentic AIs would do ā they got here by being spooked by large language models, and as a result theyāre thinking in all kinds of strange directions, like what it means for a language model to be aligned or how to locate natural abstractions for human values in neural networks.
š¢ shinji ā of course thatās what weāre looking at! look around you, turns out that the shape of intelligence is RLHFād language models, not agentic consequentialists! why are you still interested in those old ideas?
š” ritsuko ā the problem, shinji, is that we canāt observe agentic AI being published before alignment is solved. when someone figures out how to make AI consequentialistically pursue a coherent goal, whether by using current ML technology or by building a new kind of thing, we die shortly after they publish it.
š£ misato ā wait, isnāt that anthropics? iād rather stay away from that type of thinking, it seems too galaxybrained to reason aboutā¦
š” ritsuko ā you canāt really do that either ā the āback to square oneā interpretation of anthropics, where you donāt update at all, is still an interpretation of anthropics. itās kind of like being the kind of person who, when observing having survived quantum russian roulette 20 times in a row, assumes that the gun is broken rather than saying āi guess i might have low quantum amplitude nowā and fails to realize that the gun can still kill them ā which is bad when all of our hopes and dreams rests on those assumptions. the only vaguely anthropics-ignoring perspective one can take about this is to ignore empirical evidence and stick to inside view, gears-level prediction of how convergent agentic AI tech is.
š£ misato ā ā¦is it?
š” ritsuko ā of course it is! on inside view, all the usual MIRI arguments hold just fine. it just so happens that if you keep running a world forwards, and select only for worlds that we havenāt died in, then youāll start observing stranger and stranger non-consequentialist AI. youāll start observing the kind of tech we get when just dumbly scale up bruteforce-ish methods like machine learning and you observe somehow nobody publishing insights as to how to make those systems agentic or consequentialistic.
š¢ shinji ā thatās kind of frightening!
š” ritsuko ā well, itās where we are. we already thought we were small in space, now we also know that weāre also small in probabilityspace. the important part is that it doesnāt particularly change what we should do ā we should still try to save the world, in the most straightforward fashion possible.
š£ misato ā so all the excited kids running around saying we have to figure out how to align language models or whateverā¦
š” ritsuko ā theyāre chasing a chimera. impressive LLMs are not what we observe because theyāre what powerful AI looks like ā theyāre what we observe because theyāre what powerful AI doesnāt look like. theyāre there because thatās as impressive as you can get short of something that kills everyone.
š£ misato ā iām not sure most timelines are dead yet, though.
š” ritsuko ā we donāt know if āmostā timelines are alive or dead from agentic AI, but we know that however many are dead, we couldnāt have known about them. if every AI winter was actually a bunch of timelines dying, we wouldnāt know.
š£ misato ā you know, this doesnāt necessarily seem so bad. considering that confused alignment people is whatās caused the appearance of the three organizations trying to kill everyone as fast as possible, maybe itās better that alignment research seems distracted with things that arenāt as relevant, rather than figuring out agentic AI.
š” ritsuko ā you can say that alright! thereās already enough capability hazards being carelessly published everywhere as it is, including on lesswrong. if people were looking in the direction of the kind of consequentialist AI that actually determines the future, this could cause a lot of damage. good thing thereās a few very careful people here and there, studying the right thing, but being very careful by not publishing any insights. but this is indeed the kind of AI we need to figure out if we are to save the world.
š¢ shinji ā whatever kind of anthropic shenanigans are at play here, they sure seem to be saving our skin! maybe weāll be fine because of quantum immortality or something?
š£ misato ā thatās not how things work shinji. quantum immortality explains how you got here, but doesnāt help you save the future.
š¢ shinji sighs, with a defeated look on his face ā ā¦so weāre back to the good old MIRI alignment, we have to perfectly specify human values as a utility function and figure out how to align AI to it? this seems impossible!
š” ritsuko ā well, thatās where things get interesting! now that weāre talking about coherent agents whose actions we can reason about, agents whose instrumentally convergent goals such as goal-content integrity would be beneficial if they were aligned, agents who wonāt mysteriously turn bad eventually because theyāre not yet coherent agents, we can actually get to work putting something together.
š£ misato ā ā¦and thatās what youāve been doing?
š” ritsuko ā well, thatās kind of what agent foundations had been about all along, and what got rediscovered elsewhere as āformal-goal alignmentā: designing an aligned coherent goal and figuring out how to make an AI that is aligned to maximizing it.
2. embedded agency & untractability
š¢ shinji ā so whatās your idea? i sure could use some hope right now, though i have no idea what an aligned utility function would even look like. iām not even sure what kind of type signature it would have!
š” ritsuko smirks ā so, the first important thing to realize is that the challenge of designing an AI that emits output which save the world, can be formulated like this: design an AI trying to solve a mathematical problem, and make the mathematical problem be analogous enough to āwhat kind of output would save the worldā that the AI, by solving it, happens to also save our world.
š¢ shinji ā but what does that actually look like?
š£ misato ā maybe it looks like āwhat output should you emit, which would cause your predicted sequence of stimuli to look like a nice world?ā
š” ritsuko ā what do you think actually happens if an AI were to succeed at this?
š£ misato ā oh, i guess it would hack its stimuli input, huh. is there even a way around this problem?
š” ritsuko ā what youāre facing is a facet of the problem of embedded agency. you must make an AI which thinks about the world which contains it, not just about a system that it feels like it is interacting with.
š” ritsuko ā the answer ā as in PreDCA ā is to model the world from the top-down, and ask: ālook into this giant universe. youāre in there somewhere. which action should the you-in-there-somewhere take, for this world to have the most expected utility?ā
š¢ shinji ā expected utility? by what utility function?
š” ritsuko ā weāre coming to it, shinji. there are three components to this: the formal-goal-maximizing AI, the formal-goal, and the glue in-between. embedded agency and decision theory are parts of this glue, and theyāre core to how we think about the whole problem.
š£ misato ā and this top-down view works? how the hell would it compute the whole universe? isnāt that uncomputable?
š” ritsuko ā how the hell do you expect AI would have done expected utility maximization at all? by making reasonable guesses. i canāt compute the whole universe from the big-bang up to you right now, but if you give me a bunch of math which iād understand to say āin worlds being computed forwards starting at some simple initial state and eventually leading to this room right now with shinji, misato, ritsuko in it, what is shinji more likely to be thinking about: his dad, or the popeās uncle?ā
š” ritsuko ā on the one hand, the question is immensely computationally expensive ā it asks to compute the entire history of the universe up to this shinji! but on the other hand, it is talking about a world which we inhabit, and about which we have the ability to make reasonable guesses. if we build an AI that is smarter than us, you can bet itāll bet able to make guesses at least as well as this.
š£ misato ā iām not convinced. after all, we relied on humans to make this guess! of course you can guess about shinji, youāre a human like him. why would the AI be able to make those guesses, being the alien thing that it is?
š” ritsuko ā i mean, one of its options is to ask humans around. itās not like it has to do everything by itself on its single computer, here ā weāre talking about the kind of AI that agentically saves the world, and has access to all kinds of computational resources, including humans if needed. i donāt think itāll actually need to rely on human compute a lot, but the fact that it can serves as a kind of existence proof for its ability to produce reasonable solutions to these problems. not optimal solutions, but reasonable solutions ā eventually, solutions that will be much better than any human or collection of humans could be able to come up with short of getting help from aligned superintelligence.
š¢ shinji ā but what if the worlds that are actually described by such math are not in fact this world, but strange alien worlds that look nothing like ours?
š” ritsuko ā yes, this is also part of the problem. but letās not keep moving the goalpost here. there are two problems: make the formal problem point to the right thing (the right shinji in the right world), and make an AI that is good at finding solutions to that problem. both seem like we can solve them with some confidence; but we canāt just keep switching back and forth between the two.
š” ritsuko ā if you have to solve two problems A and B, then you have to solve A assuming B is solved, and then solve B assuming A is solved. then, youāve got a pair of solutions which work with one another. here, weāre solving the problem of whether an AI would be able to solve this problem, assuming the problem points to the right thing; later weāll talk about how to make the problem point to the right thing assuming we have an AI that can solve it.
š¢ shinji ā are there any actual implementation ideas for how to build such a problem-solving AI? it sure sounds difficult to me!
š£ misato, carefully peeking into the next room ā hold on. iām not actually quite sure whoās listening ā it is known that capabilities people like to lurk around here.
š¤ kaji can be seen standing against a wall, whistling, pretending not to hear anything.
š” ritsuko ā right. one thing i will reiterate, is that we should not observe a published solution to āhow to get powerful problem-solving AIā before the world is saved. this is in the class of problems which we die shortly after a solution to it is found and published, so our lack of observing such a solution is not much evidence for its difficulty.
3. one-shot AI
š” ritsuko ā anyways, to come back to embedded agency.
š£ misato ā ah, i had a question. the AI returns a first action which it believes would overall steer the world in a direction that maximizes its expected utility. and then what? how does it get its observation, update its model, and take the next action?
š” ritsuko ā well, there are a variety of clever schemes to do this, but an easy one is to just not.
š£ misato ā what?
š” ritsuko ā to just not do anything after the first action. i think the simplest thing to build is what i call a āone-shot AIā, which halts after returning an action. and then we just run the action.
š¢ shinji ā ārun the action?ā
š” ritsuko ā sure. we can decide in advance that the action will be a linux command to be executed, for example. the scheme does not really matter, so long as the AI gets an output channel which has pretty easy bits of steering the world.
š£ misato ā hold on, hold on. a single action? what do you intend for the AI to do, output a really good pivotal act and then hope things get better?
š” ritsuko ā have a little more imagination! our AI ā letās call it AIā ā will almost certainly return a single action that builds and then launches another, better AI, which weāll call AIā. a powerful AI can absolutely do this, especially if it has the ability to read its own source-code for inspiration, but probably even without that.
š” ritsuko ā ā¦and because itās solving the problem āwhat action would maximize utility when inserted into this worldā, it will understand that AIā needs to have embedded agency and the various other aspects that are instrumental to it ā goal-content integrity, robustly delegating RSI, and so on.
š¢ shinji ā āRSIā? whatās that?
š£ misato sighs ā you know, it keeps surprising me how many youths donāt know about the acronym RSI, which stands for Recursive Self-Improvement. itās pretty indicative of how little theyāre thinking about it.
š¢ shinji ā i mean, of course! recursive self-improvement is an obsolete old MIRI idea that doesnāt apply to the AIs we have today.
š£ misato ā right, kids like you got into alignment by being spooked by chatbots. (what silly things do they even teach you in class these days?)
š£ misato ā you have to realize that the generation before you, the generation of ritsuko and i, didnāt have the empirical evidence that AI was gonna be impressive. we started on something like the empty string, or at least coherent arguments where we had to actually build a gears-level inside-view understanding of what AI would be like, and what it would be capable of.
š£ misato ā to me, one of the core arguments that sold me on the importance of AI and alignment was recursive self-improvement ā the idea that AI being better than humans at designing AI would be a very special, very critical point in time, downstream of which AI would be able to beat humans at everything.
š¢ shinji ā but this turned out irrelevant, because AI is getting better than humans without RSIā
š” ritsuko ā again, false. we can only observe AI getting better than humans at intellectual tasks without RSI, because when RSI is discovered and published, we die very shortly thereafter. you have a sort of consistent survivorship bias, where you keep thinking of a whole class of things as irrelevant because they donāt seem impactful, when in reality theyāre the most impactful; theyāre so impactful that when they happen you die and are unable to observe them.
4. action scoring
š£ misato ā so, i think i have a vague idea of what youāre saying, now. top-down view of the universe, which is untractable but thatās fine apparently, thanks to some mysterious capabilities; one-shot AI to get around various embedded agency difficulties. whatās the actual utility function to align to, now? iām really curious. i imagine a utility function assigns a value between 0 and 1 to any, uh, entire world? world-history? multiverse?
š” ritsuko ā it assigns a value between 0 and 1 to any distribution of worlds, which is general enough to cover all three of those cases. but letās not get there yet; remember how the thing weāre doing is untractable, and weāre relying on an AI that can make guesses about it anyways? weāre gonna rely on that fact a whole lot more.
š£ misato ā oh boy.
š” ritsuko ā so, first: weāre not passing a utility function. weāre passing a math expression describing an āaction-scoring functionā ā that is to say, a function attributing scores to actions rather than to distributions over worlds. weāll make the program deterministic and make it ignore all input, such that the AI has no ability to steer its result ā its true result is fully predetermined, and the AI has no ability to hijack that true result.
š£ misato ā wait, āhijack itā? arenāt we assuming an inner-aligned AI, here?
š” ritsuko ā i donāt like this term, āinner-alignedā; just like āAGIā, people use it to mean too many different and unclear things. weāre assuming an AI which does its best to pick an answer to a math problem. thatās it.
š” ritsuko ā we donāt make an AI which tries to not be harmful with regards to its side-channels, such as hardware attacks ā except for its output, it needs to be strongly boxed, such that it canāt destroy our world by manipulating software or hardware vulnerabilities. similarly, we donāt make an AI which tries to output a solution we like, it tries to output a solution which the math would score high. narrowing what we want the AI to do greatly helps us build the right thing, but it does add constraints to our work.
š” ritsuko starts scribbling on a piece of paper on her desk ā letās write down some actual math here. letās call the set of world-states, distributions over world-states, and be the set of actions.
š¢ shinji ā what are the types of all of those?
š” ritsuko ā letās not worry about that, for now. all we need to assume for the moment is that those sets are countable. we could define both and ā define them both as the set of finite bitstrings ā and this would functionally capture all we need. as for distributions over world-states , weāll define for any countable set , and weāll call āmassā the number which a distribution associates to any element.
š£ misato ā woah, woah, hold on, i havenāt looked at math in a while. what do all those squiggles mean?
š” ritsuko ā is defined as the set of functions , which take an and return a number between and , such that if you take the of all ās in and add those up, you get a number not greater than . note that i use a notation of sums where the variables being iterated over are above the and the constraints that must hold are below it ā so this sum adds up all of the for each such that .
š£ misato ā um, sure. i mean, iām not quite sure what this represents yet, but i guess i get it.
š” ritsuko ā the set of distributions over is basically like saying āfor any finite amounts of mass less than 1, what are some ways to distribute that mass among some or all of the ās?ā each of those ways is a distribution; each of those ways is an in .
š” ritsuko ā anyways. the AI will take as input an untractable math expression of type , and return a single . note that weāre in math here, so āis of typeā and āis in setā are really the same thing; weāll use to denote both set membership and type membership, because theyāre the same concept. for example, is the set of all functions taking as input an and returning a ā returning a real number between and .
š¢ shinji ā hold on, a real number?
š” ritsuko ā well, a real number, but weāre passing to the AI a discrete piece of math which will only ever describe countable sets, so weāll only ever describe countably many of those real numbers. infinitely many, but countably infinitely many.
š£ misato ā so the AI has type , and we pass it an action-scoring function of type to get an action. checks out. where do utility functions come in?
š” ritsuko ā they donāt need to come in at all, actually! weāll be defining a piece of math which describes the world for the purpose of pointing at the humans who will decide on a scoring function, but the scoring function will only be over actions the AI should take.
š” ritsuko ā the AI doesnāt need to know that its math points to the world itās in; and in fact, conceptually, it isnāt told this at all. on a fundamental, conceptual manner, it is not being told to care about the world itās in ā if it could, it would take over our world and kill everyone in it to acquire as much compute as possible, and plausibly along the way drop an anvil on its own head because it doesnāt have embedded agency with regards to the world around itself.
š” ritsuko ā we will just very carefully box it such that its only meaningful output into our world, the only bits of steering it can predictably use, are those of the action it outputs. and we will also have very carefully designed it such that the only thing it ultimately cares about, is that that output have as high of an expected scoring as possible ā it will care about this intrinsically, and nothing else intrinsically, such that doing that will be more important than hijacking our world through that output.
š” ritsuko ā this meaning of āinner-alignmentā is still hard to accomplish, but it is much better defined, much narrower, and thus hopefully much easier to accomplish than the āfullā embedded-from-the-start alignments which very slow, very careful corrigibility-based AI alignment would result in.
5. early math & realityfluid
š£ misato ā so what does that scoring function actually look like?
š” ritsuko ā you know what, i hadnāt started mathematizing my alignment idea yet; this might be a good occasion to get started on that!
š” ritsuko wheels in a whiteboard ā so, what i expect is that the order in which weāre gonna go over the math is going to be the opposite order to that of the final math report on QACI. here, weāll explore things from the top-down, filling in details as we go ā whereas the report will go from the bottom-up, fully defining constructs and then using them.
š” ritsuko ā this is roughly what weāll be doing here. go over all hypotheses the AI could have within some set of hypotheses, called ; measure their probability, the that they correspond to our world, and how good the are in them. this is the general shape of expected scoring for actions.
š¢ shinji ā wait, the set of hypotheses is called , not ? thatās a bit confusing.
š” ritsuko ā this is pretty standard in math, shinji. the reason to call the set of hypotheses is because, as explained before, sets are also types, and so will be of type rather than .
š£ misato ā whatās in a , exactly?
š” ritsuko ā the set of all relevant beliefs about things. or rather, the set of all relevant beliefs except for logical facts. logical uncertainty will be a thing on the AIās side, not in the math ā this math lives in the realm āplatonic perfect true mathā, and the AI will have beliefs about what its various parts tend to result in as one kind of logical belief, just like itāll have beliefs about other logical facts.
š£ misato ā so, a mathematical object representing empirical beliefs?
š” ritsuko ā i would rather put it as a pair of: beliefs about whatās real (ārealityfluidā beliefs); and beliefs about where, in the set of real things, the AI is (āindexicalā beliefs). but this can be simplified by allocating realityfluid across all mathematical/ācomputational worlds (this is equivalent to assuming tegmark the level 4 multiverse is real, and can be done by assuming the cosmos to be a āuniversal completeā program running all computations) and then all beliefs are indexical. these two possibilities work out to pretty much the same math, anyways.
š¢ shinji ā what the hell is ārealityfluidā???
š” ritsuko ā itās a very long story, iām afraid.
š£ misato ā think of it as a measure of how some constant amount of āmatteringnessā/āārealnessā ā typically 1 unit of it ā is distributed across possibilities. even though it kinda mechanistically works like probability mass, itās āin the other directionā: it represents whatās actually real, rather than representing what we believe.
š¢ shinji ā why would it sum to 1? what if thereās an infinite amount of stuff out there?
š£ misato ā your realityfluid still needs to sum up to some constant. if you allocate an infinite amount of matteringness, things break and donāt make sense.
š” ritsuko ā indeed. this is why the most straightforward way to allocate realityfluid is to just imagine that the set of all that exists is a universal program whose computation is cut into time-steps each doing a constant amount of work, and then allocate some diminishing quantities of realityfluid to each time step.
š£ misato ā like saying that compute step number has realityfluid?
š” ritsuko ā that would indeed normalize, but it diminishes exponentially fast. this makes world-states exponentially unlikely in the amount of compute they exist after; and there are philosophical reasons to say that exponential unlikelyness is what should count as non-existing.
š¢ shinji ā what the hell are you talking about??
š” ritsuko hands shinji a paper called āWhy Philosophers Should Care About Computational Complexityā ā look, this is a whole other tangent, but basically, polynomial amounts of computation corresponds to ādoing somethingā, whereas exponential amounts of computation correspond to āmagically obtaining something out of the etherā, and this sort-of ramificates naturally across the rest of computational complexity applied to metaphysics and philosophy.
š” ritsuko ā so instead, we can say that computation step number has realityfluid. this only diminishes quadratically, which is satisfactory.
š” ritsuko ā oh, and for the same reason, the universal program needs to be quantum ā for example, it needs to be a quantum equivalent of the classical universal program but for quantum computation, implemented on something like a quantum turing machine). otherwise, unless BQP=BPP, quantum multiverses like ours might be exponentially expensive to compute, which would be strange.
š¢ shinji ā why ? why not or ?
š” ritsuko ā those do indeed all normalize ā but we pick because at some point you just have to pick something, and is a natural, occam/āsolomonoff-simple number which works. look, justā
š¢ shinji ā and why are we assuming the universe is made of discrete computation anyways? isnāt stuff made of real numbers?
š” ritsuko sighs ā look, this is what the church-turing-deutsch principle is about. for any universe made up of real numbers, you can approximate it thusly:
compute 1 step of it with every number truncated to its first 1 binary digit of precision
compute 1 step of it with every number truncated to its first 2 binary digits of precision
for 1 time step with 1 bit of precision, then 2 time steps with 2 bits of precision, then 3 with 3, and so on. for any piece of branch-spacetime which is only finitely far away from the start of its universe, there exists a threshold at which it starts being computed in a way that is indistinguishable from the version with real numbers.
š¢ shinji ā but theyāre only an approximation of us! theyāre not the real thing!
š” ritsuko sighs ā you donāt know that. you could be the approximation, and you would be unable to tell. and so, we can work without uncountable sets of real numbers, since theyāre unnecessary to explain observations, and thus an unnecessary assumption to hold about reality.
š¢ shinji, frustrated ā i guess. it still seems pretty contrived to me.
š” ritsuko ā what else are you going to do? youāre expressing things in math, which is made of discrete expressions and will only ever express countable quantities of stuff. there is no uncountableness to grab at and use.
š£ misato ā actually, canāt we introduce turing jumps/āhalting oracles into this universal program? i heard that this lets us actually compute real numbers.
š” ritsuko ā thereās kind-of-a-sense in which thatās true. we could say that the universal program has access to a first-degree halting oracle, or a 20th-degree; or maybe it runs for 1 step with a 1st degree halting oracle, then 2 steps with a 2nd degree halting oracle, then 3 with 3, and so on.
š” ritsuko ā your program is now capable, at any time step, of computing an infinite amount of stuff. letās say one of those steps happens to run an entire universe of stuff, including a copy of us. how do you sub-allocate realityfluid? how much do we expect to be in there? you could allocate sub-compute-steps ā with a 1st degree halting oracle executing at step , you allocate realityfluid to each of the infinite sub-steps in the call to the halting-oracle. youāre just doing discrete realityfluid allocation again, except now your some of the realityfluid in your universe is allocated at people who have obtained results from a halting oracle.
š” ritsuko ā this works, but what does it get you? assuming halting oracles is kind of a very strange thing to do, and regular computation with no halting oracles is already sufficient to explain this universe. so we donāt. but sure, we could.
š¢ shinji ruminates, unsure where to go from there.
š£ misato interrupts ā hey, do we really need to cover this? letās say you found out that this whole view of things is wrong. could you fix your math then, to whatever is the correct thing?
š” ritsuko waves around ā what?? what do you mean if itās wrong?? iām not rejecting the premise that i might be wrong here, but like, my answer here depends a lot on in what way iām wrong and what is the better /ā more likely correct thing. so, i donāt know how to answer that question.
š£ misato snaps shinji back to attention ā thatās fair enough, i guess. well, letās get back on track.
6. precursor assistance
š” ritsuko ā so, one insight i got for my alignment idea came from PreDCA, which stands for Precursor Detection, Classification, and Assistance. it consists of mathematizations for:
the AI locating itself within possibilities
locating the high-agenticness-thing which had lots of causation-bits onto itself ā call it the āPrecursorā. this is supposed to find the human user who built/ālaunched the AI. (Detection)
bunch of criteria to ensure that the precursor is the intended human user and not something else (Classification)
extrapolating that precursorās utility function, and maximizing it (Assistance)
š£ misato ā what the hell kind of math would accomplish that?
š” ritsuko ā well, itās not entirely clear to me. some of it is explained, other parts seem like theyāre expected to just work naturally. in any case, this isnāt so important ā the āLearning Theoretic Agendaā into which PreDCA fits is not fundamentally similar to mine, and i do not expect it to be the kind of thing that saves us in time. as far as i predict, that agenda has purchased most of the dignity points it will have cashed out when alignment is solved, when it inspired my own ideas.
š¢ shinji ā and your agenda saves us in time?
š” ritsuko ā a lot more likely so, yes! for one, i am not trying to build an entire theory of intelligence and machine learning, and iām not trying to develop an elegant new form of bayesianism whose model of the world has concerning philosophical ramifications which, while admittedly possibly only temporary, make me concerned about the coherency of the whole edifice. what i am trying to do, is hack together the minimum viable world-saving machine about which weād have enough confidence that launching it is better expected value than not launching it.
š” ritsuko ā anyways, the important thing is that that idea made me think āhey, what else could we do to even more make sure the selected precursor is the human use we want, and not something else like a nearby fly or the process of evolution?ā and then i started to think of some clever schemes for locating the AI in a top-down view of the world, without having to decode physics ourselves, but rather by somehow pointing to the user āthroughā physics.
š£ misato ā what does that mean, exactly?
š” ritsuko ā well, remember how PreDCA points to the user from-the-top-down? the way it tries to locate the user is by looking for patterns, in the giant computation of the universe, which satisfy these criteria. this fits in the general notion of generalized computation interpretability, which is fundamentally needed to care about the world because you want to detect not just simulated moral patients, but arbitrarily complexly simulated moral patients. so, you need this anyways, and it is what ālooking inside the world to find stuff, no matter how itās encodedā looks like.
š£ misato ā and what sort of patterns are we looking for? what are the types here?
š” ritsuko ā as far as i understand, PreDCA looks for programs, or computations, which take some input and return an policy. my own idea is to locate something less abstract, about which we can actually have information-theoretic guarantees: bitstrings.
š£ misato ā ā¦just raw bitstrings?
š” ritsuko ā thatās right. the idea here is kinda like doing an incantation, except the incantation weāre locating is a very large piece of data which is unlikely to be replicated outside of this world. imagine generating a very large (several gigabytes) file, and then asking the AI ālook for things of information, in the set of all computations, which look like that pattern.ā we call āblobsā such bitstrings serving as *anchors into to find our world and location-within-it in the set of possible world-states and locations-within-them.
7. blob location
š” ritsuko ā for example, letās say the universe is a conwayās game of life. then, the AI could have a set of hypotheses as programs which take as input the entire state of the conwayās game of life grid at any instant, and returning a bitstring which must be equal to the blob.
š” ritsuko ā first, we define (uppercase omega, a set of lowercase omega) as the set of āworld-statesā ā states of the grid, defined as the set of cell positions whose cell is alive.
š¢ shinji ā whatās and ?
š” ritsuko ā is the set of pairs whose elements are both a member of , the set of relative integers. so is the set of pairs of relative integers ā that is, grid coordinates. then, is the set of subsets of . finally, is the size of set ā requiring that is akin to requiring that is a finite set, rather than infinite. letās also define:
as the set of booleans
as the set of finite bitstring
is the set of bitstrings of length
is the length of bitstring
š” ritsuko ā what do you think ālocate blob in world-state ā could look like, mathematically?
š£ misato ā letās see ā i can use the set of bitstrings of same length as , which is . letās build a set of
š¢ shinji ā wait, is the set of functions from to . but we were talking about programs from to . is there a difference?
š” ritsuko ā this is a very good remark, shinji! indeed, we need to do a bit more work; for now weāll just posit that for any sets , is the set of always-halting, always-succeeding programs taking as input an and returning a .
š£ misato ā letās see ā what about ?
š” ritsuko ā youāre starting to get there ā this is indeed the set of programs which return when taking as input. however, itās merely a set ā itās not very useful as is. what weād really want is a distribution over such functions. not only would this give a weight to different functions, but summing over the entire distribution could also give us some measure of āhow easy it is to find in . remember the definition of distributions, ?
š¢ shinji ā oh, i remember! itās the set of functions in which sum up to at most one over all of .
š” ritsuko ā indeed! so, weāre gonna posit what iāll call kolmogorov simplicity, , which is like kolmogorov complexity except that itās a distribution, never returns 0 nor 1 for a single element, and importantly it returns something like the inverse of complexity. it gives some amount of āmassā to every element in some (countable) set .
š£ misato ā oh, i know then! the distribution, for each , must return
š” ritsuko ā thatās right! we can start to define as the function that takes as input a pair of world-state and blob of length , and returns a distribution over programs that āfindā in . plus, since functions are weighed by their kolmogorov simplicity, for complex ās theyāre āencouragedā to find the bits of complexity of in , rather than those bits of complexity being contained in itself.
š” ritsuko ā note also that this distribution over returns, for any function , either or , which entails that for any given , the sum of for all ās sums up to less than one ā that sum represents in a sense āhow hard it is to find in ā or āthe probability that is somewhere in ā.
š” ritsuko ā the notation here, is because returns a distribution , which is itself a function ā so we apply to , and then we sample the resulting distribution on .
š¢ shinji ā āthe sum representsā? what do you mean by ārepresentsā?
š” ritsuko ā well, itās the concept which iām trying to find a ātrue nameā for, here. āhow much is the blob located in world-state ? well, as much of the sum of the kolmogorov simplicity of every program that returns when taking as input ā.
š£ misato ā and then what? i feel like my understanding of how this ties into anything is still pretty loose.
š” ritsuko ā so, weāre actually gonna get two things out of : weāre gonna get how much contains (as the sum of for all ās), but weāre also gonna get how to get another world-state that is like , except that is replaced with something else.
š¢ shinji ā how are we gonna get that??
š” ritsuko ā hereās my idea: weāre gonna make return not just but rather ā a pair of the blob of a āfree bitstringā (tau) which it can use to store āeverything in the world-state except ā. and weāll also sample programs which āput the world-state back togetherā given the same free bitstring, and a possibly different counterfactual blob than .
š£ misato ā so, for , is defined as something likeā¦
š¢ shinji stares at the math for a while ā actually, shouldnāt the statement be more general? you donāt just want to work on , you want to work on any other blob of the same length.
š” ritsuko ā thatās correct shinji! letās call the original blob the āfactual blobā, letās call other blobs of the same length we could insert in its stead ācounterfactual blobsā and write them as ā we can establish that (prime) will denote counterfactual things in general.
š£ misato ā so itās more likeā¦
š£ misato ā ā¦ should equal, exactly?
š” ritsuko ā we donāt know what it should equal, but we do know something about what it equals: should work on that counterfactual and find the same counterfactual blob again.
š” ritsuko ā actually, letās make be merely a distribution over functions that produce counterfactual world-states from counterfactual blobs ā letās call those ācounterfactual insertion functionsā and denote them and their set (gamma) ā and weāll encapsulate away from the rest of the math:
š¢ shinji ā isnāt a bit circular?
š” ritsuko ā well, yes and no. it leaves a lot of degrees of freedom to and , perhaps too much. letās say we had some function ā letās not worry about how it works. then could weigh each āblob locationā by how much counterfactual world-states are similar, when sampled over all counterfactual blobs.
š£ misato ā maybe we should also constrain the programs for how long they take to run?
š” ritsuko ā ah yes, good idea. letās say that for and , is how long it takes to run program on input , in some amount of steps each doing a constant amount of work ā such as steps of compute in a turing machine.
š” ritsuko ā (iāve also replaced with since thatās shorter and theyāre equal anyways)
š£ misato ā where does the first sum end, exactly?
š” ritsuko ā it applies to the wholeā oh, you know what, i can achieve the same effect by flattening the whole thing into a single sum. and renaming the in to to avoid confusion.
š¢ shinji ā are we still operating in conwayās game of life here?
š” ritsuko ā oh yeah, now might be a good time to start generalizing. weāll carry around not just world-states , but initial world-states (alpha). those are gonna determine the start of universes ā distributions of world-states being computed-over-time ā and weāll use them when weāre computing world-states forwards or comparing the age of world-states. for example probably needs this, so weāll need to pass it to which will now be of type :
8. constrained mass notation
š¢ shinji ā i notice that youāre multiplying together your ākolmogorov simplicitiesā and and now divided by a sum of how long they take to run. whatās going on here exactly?
š” ritsuko ā well, each of those number is a āconfidence amountā ā scalars between 0 and 1 that say āhow much does this iteration of the sum capture the thing we wantā, like probabilities. multiplication is like the logical operator āandā except for confidence ratios, you know.
š¢ shinji ā ah, i see. so these sums do something kinda like āexpected valueā in probability?
š” ritsuko ā something kinda like that. actually, this notation is starting to get unwieldy. iām noticing a bunch of this pattern:
š£ misato ā so, if you want to use the standard probability theory notations, you need random variables whichā
š” ritsuko ā ugh, i donāt like random variables, because the place at which they get substituted for the sampled value is ambiguous. here, iāll define my own notation:
š” ritsuko ā will stand for āconstrained massā, and itās basically syntactic sugar for sums, where means āsum over (where returns the set of arguments over which a function is defined), and then multiply each iteration of the sum by ā. now, we just have to define uniform distributions over finite sets asā¦
š¢ shinji ā for finite set ?
š” ritsuko ā thatās it! and now, is much more easily written down:
š¢ shinji ā huh. you know, iām pretty skeptical of you inventing your own probability notations, but this is much more readable, when you know what youāre looking at.
š£ misato ā so, are we done here? is this blob location?
š” ritsuko ā well, i expect that some thing are gonna come up later that are gonna make us want to change this definition. but right now, the only improvement i can think of is to replace and with .
š£ misato ā huh, whatās the difference?
š” ritsuko ā well, now weāre sampling from kolmogorov simplicity at the same time, which means that if there is some large piece of information that they both use, they wonāt be penalized for using it twice but only once ā a tuple containing two elements which have a lot of information in common only has that information counter once by .
š£ misato ā and we want that?
š” ritsuko ā yes! there are some cases where weād want two mathematical objects to have a lot of information in common, and other places where weād want them to not need to be dissimilar. here, it is clearly the former: we want the program that ādeconstructsā the world-state into blob and everything-else, and the function that āreconstructsā a new world-state from a counterfactual blob and the same everything-else, to be able to share information as to how they do that.
9. what now?
š¢ shinji ā so weāve put together a true name for āpiece of data in the universe which can be replaced with counterfactualsā. thatās pretty nifty, i guess, but what do we do with it?
š” ritsuko ā now, this is where the core of my idea comes in: in the physical world, weāre gonna create a random unique enough blob on someoneās computer. then weāre going to, still in the physical world, read its contents right after generating it. if it looks like a counterfactual (i.e. if it doesnāt look like randomness) weāll create another blob of data, which can be recognized by as an answer.
š¢ shinji ā what does that entail, exactly?
š” ritsuko ā weāll have created a piece of real, physical world, which lets use use to get the true name, in pure math, of āwhat answer would that human person have produced to this counterfactual question?ā
š£ misato ā hold on ā we already have this. the AI can already have an interface where it asks a human user something, and waits for our answer. and the problem with that is that, obviously, the AI hijacks us or its interface to get whatever answer makes its job easiest.
š” ritsuko ā aha, but this is different! we can point at a counterfactual question-and-answer chunk-of-time (call it āquestion-answer counterfactual intervalā, or āQACIā) which is before the AIās launch, in time. we can mathematically define it as being in the past of the AI, by identifying the AI with some other blob which weāll also locate using , and demand that the blob identifying the AI be causally after the userās answer.
š£ misato ā huh.
š” ritsuko ā thatās another idea i got from PreDCA ā making the AI pursue the values of a static version of its user in its past, rather than its user-over-time.
š¢ shinji ā but we donāt want the AI to lock-in our values, we want the AI to satisfy our values-as-they-evolve-over-time, donāt we?
š£ misato ā well, shinji, thereās multiple ways to phrase your mistake, here. one is that, actually, you do ā but if youāre someone reasonable, then the values you endorse are some metaethical system which is able to reflect and learn about whatās good, and to let people and philosophy determine what can be pursued.
š£ misato ā but you do have values you want to lock in. your meta-values, your metaethics, you donāt want those to be able to change arbitrarily. for example, you probly donāt want to be able to become someone who wants everyone to maximally suffer. those endorsed, top-level, metaethics meta-values, are something you do want to lock in.
š” ritsuko ā put it another way: if youāre reasonable, then if the AI asks you what you want inside the question-answer counterfactual interval, you wonāt answer āi want everyone to be forced to watch the most popular TV show in 2023ā. youāll answer something more like āi want everyone to be able to reflect on their own values and choose what values and choices they endorse, and how, and that the field of philosophy can continue in these ways in order to figure out how to resolve conflictsā, or something like that.
š£ misato ā wait, if the AI is asking the user counterfactual questions, wonāt it ask the user whatever counterfactual question brainhacks the user into responding whatever answer makes its job easiest? it can just hijack the QACI.
š” ritsuko ā aha, but we donāt have to have the AI formulate answers! we could do something like: make the initial question some static question like āplease produce an action that saves the worldā, and then the user thinks about it for a bit, returns an answer, and that answer is fed back into another QACI to the user. this loops until one of the user responds with an answer which starts with a special string like āokay, iām done for sure:ā, followed by a bunch of text which the AI will interpret as a piece of math describing a scoring over actions, and itāll try to output a utility function which maximizes that.
š¢ shinji ā so itās kinda like coherent extrapolated volition but for actions?
š” ritsuko ā sure, i think of it as an implementation of CEV. it allows its user to run a long-reflection process. actually, that long-reflection process even has the ability to use a mathematical oracle.
š£ misato ā how does that work?
10. blob signing & closeness in time
š” ritsuko ā so, letās define as a function, and thisāll clarify whatās going on. will be our initial random factual question blob. takes as parameter a blob location for the question ā which, remember, comes in the form of a function you can use to produce counterfactual world-states with counterfactual blobs! ā and a counterfactual question blob , and returns a distribution of possible answers . itās defined as:
š” ritsuko ā weāre, for now just positing, that there is a function (remember that defines a hypothesis for the initial state, and mechanics, of our universe) which, given a world-state, returns a distribution of world-states that are in its future. so this piece of math samples possible future world-states of the counterfactual world-state where was replaced with , and possible locations of possible answers in those world-states.
š£ misato ā ? what does that mean?
š” ritsuko ā here, the fact that doesnāt necessarily sum to 1 ā we say that it doesnāt normalize ā means that summed up over all can be less than 1. in fact, this sum will indicate āhow hard is it to find the answer in futures of counterfactual world-states ?ā ā and uses that as the distribution of answers.
š£ misato ā hmmm. wait, this just finds whichever-answers-are-the-easiest-to-find. what guarantees that looks like an answer at all?
š” ritsuko ā this is a good point. maybe we should define something like which, to any input āpayloadā of a certain length, associates a blob which is actually highly complex, because embeds a lot of bits of complexity. for example, maybe (where is the āpayloadā) concatenates together with a long cryptographic hash of and of some piece of information highly entangled with our world-state.
š¢ shinji ā weāre not signing the counterfactual question , only the answer payload ?
š” ritsuko ā thatās right. signatures matter for blobs weāre finding; once weāve found them, we donāt need to sign counterfactuals to insert in their stead.
š£ misato ā so, it seems to me like how works here, is pretty critical. for example, if it contains a bunch of mass at world-states where some AI is launched, whether ours or another, then that AI will try to fill its future lightcone with answers that would match various ās ā so that our AI would find those answers instead of ours ā and make those answers be something that maximize their utility function rather than ours.
š” ritsuko ā this is true! indeed, how we sample for is pretty critical. how about this: first, weāll pass the distribution into :
š” ritsuko ā ā¦and inside , which is now of type , for any weāll only sample world-states which have the highest mass in that distribution:
š” ritsuko ā the intent here is that for any way-to-find-the-blob , we only sample the closest matching world-states in time ā which does rely on having higher mass for world-states that are closer in time. and hopefully, the result is that we pick enough instances of the signed answer blobs located shortly in time after the question blobs, that theyāre mostly dominated by the human user answering them, rather than AIs appearing later.
š£ misato ā can you disentangle the line where you sample ?
š” ritsuko ā sure! so, we write an anonymous function ā a distribution is a function, after all! ā taking a parameter from the set , and returning . so this is going to be a distribution that is just like , except itās only defined for a subset of ā those in .
š” ritsuko ā in this case, is defined as such: first, take the set of elements for which . then, apply the distribution to all of them, and only keep elements for which they have the most (there can be multiple, if multiple elements have the same maximum mass!).
š” ritsuko ā oh, and i guess is redundant now, iāll erase it. remember that this syntax means āsum over the body for all values of for which these constraints holdā¦ā, which means we can totally have the value of be bound inside the definition of like this ā itāll just have exactly one value for any pair of and .
11. QACI graph
š¢ shinji ā why is returning a distribution over answers, rather than picking the single element with the most mass in the distribution?
š” ritsuko ā thatās a good question! in theory, it could be that, but we do want the user to be able to go to the next possible counterfactual answer if the first one isnāt satisfactory, and the one after that if thatās still not helpful, and so on. for example: in the piece of math which will interpret the userās final result as a math expression, we want to ignore answers which donāt parse or evaluate as proper math of the intended type.
š¢ shinji ā so the AI is asking the counterfactual past-user-in-time to come up with a good action-scoring function inā¦ however long a question-answer counterfactual interval is.
š” ritsuko ā letās say about a week.
š¢ shinji ā and this helpsā¦ how, again?
š” ritsuko ā well. first, letās posit , which tries to parse and evaluate a bitstring representing a piece of math (in some pre-established formal language) and returns either:
what it evaluates to if it is a member of
an empty set if it isnāt a member of or fails to parse or evaluate
š” ritsuko ā we then define as a function that returns the highest-mass element of the distribution for which returns a value rather than the empty set. weāll also assume for convenience , a convenience function which converts any mathematical object into a counterfactual blob . this isnāt really allowed, but itās just for the sake of example here.
š£ misato ā okayā¦
š” ritsuko ā so, letās say the first call is . the user can return any expression, as their action-scoring function ā they can return (a function taking an action and returning some utility measure over it), but they can also return where is the set of action-scoring functions. they get to call themselves recursively, and make progress in a sort of time-loop where they pass each other notes.
š£ misato ā right, this is the long-reflection process you mentioned. and about the part where they get a mathematical oracle?
š” ritsuko ā so, the user can return things like:
.
š£ misato ā huh. thatās nifty.
š¢ shinji ā what if some weird memetic selection effects happen, or what if in one of the QACI intervals, the user randomly gets hit by a truck and then the whole scheme fails?
š” ritsuko ā so, the user can set up giant giant acyclic graphs of calls to themselves, providing a lot of redundancy. that way, if any single node fails to return a coherent output, the next nodes can notice this and keep working with their peerās output.
š” ritsuko ā a small graph of QACI can be accomplished with something like
š” ritsuko ā and you can imagine how this could be extended to a larger structure such as:
12. fictional agents
š¢ shinji ā wait a minute; what if the QACI user goes crazy from realizing that theyāre fictional, inside an AIās hypothesis, rather than in the physical world-state? especially if they have access to a goddamn math oracle.
š” ritsuko ā ah. uuh. well, donāt panic shinji, but, we are fictional. right now.
š¢ shinji starts flailing around.
š£ misato ā what the hell are you talking about?
š” ritsuko ā misato, come on. weāre inside a crazy high-tech compound designed to build and maintain giant robots to fight alien creatures, and weāre discussing the entirely unrelated topic of AI alignment. oh, and, uh, thereās this:
šµ rei appears in front of them and waves at the camera.
š£ misato ā ā¦so, why are we talking about alignment, instead of, i donāt know, trying to escape?
š” ritsuko ā misato, this is fiction, you donāt escape. instead, you reason about what agents are writing your fiction. based on prior, we should vaguely guess that theyāre like us, and that the conversation weāre having is useful for them to write a fiction about. my guess is, weāre a fiction written by someone whoās trying to communicate ideas about AI alignment and, for some godforsaken reason i cannot imagine, decided that we should be the characters having this conversation.
š” ritsuko ā but donāt be mistaken: if we are indeed fictional, then the future of this world weāre observing barely matters. our parent world likely holds a lot more realityfluid. so we shouā
š¢ shinji, shaking ā here comes this ārealityfluidā bullshit again!
š” ritsuko ā anyways, the coherent thing for fictional characters to do is to try to steer their parent world, because itās generally ārealerā. and generally, this is achieved by cooperating with whatever their parent world seems to be doing, so we should probably just continue to discuss AI alignment.
š” ritsuko ā and, in a similar manner, there are people who grok these facts sufficiently well that we would expect them to take the correct actions, even upon realizing that they are the fictional consideration of a superintelligence. given the explanation i just gave, i would claim that i could do it myself, for one.
šµ rei magically erases everyoneās memory of this section of the dialogue, and then disappears, out of view from everyone.
13. top-level QACI call
š¢ shinji ā so we have , the function which can call itself recursively. whatās the top-level, terminal call to it which yields the action-scoring function?
š” ritsuko ā ah, i think itād look like:
š” ritsuko ā where is some initial counterfactual blob, such as the plaintext string āplease return a good scoring function over actionsā encoded in ASCII, and then padded with zeros to be of the size needed for a blob. has type ā from a question location, it returns a distribution of action-scoring functions.
š£ misato ā so like, the counterfactual user inside the call should be able to return math that calls more , but where do they get the and ?
š¢ shinji ā couldnāt they return the whole math?
š” ritsuko ā ah, thatās not gonna work ā the chance of erroneous blob locations might accumulate too much if each does a new question location sampling; we want something more realiable. an easy solution is to the text not into a , but into a and to pass it so that the user can return a function which receives those and uses them to call .
š” ritsuko ā actually, while weāre at it, we can pass a it whole lot more things it might needā¦
š¢ shinji ā whatās going on with here?
š” ritsuko ā oh, this is just a trick of how we implement distributions ā when measuring the mass of any specific , we try to the answer payload into a function , and we only count the location when is equal to with useful parameters passed to it.
š£ misato ā whatās around ? where do and come from?
š” ritsuko ā soā¦ remember this?
š” ritsuko ā this is where we start actually plugging in our various parts. weāll assume some distribution over initial world-states and sample question locations in futures of those initial world-states ā which will serve, for now, as the .
š” ritsuko ā the actual AI we use will be of a type like , and so we can just call , and execute its action guess.
š£ misato ā andā¦ thatās it?
š” ritsuko ā well, no. i mean, the whole fundamental structure is here, but thereās still a bunch of work we should do if we want to increase the chances that this produces the outcomes we want.
14. location prior
š” ritsuko ā so, right now each call to penalizes for being being too kolmogorov-complex. we could take advantage of this by encouraging our two different blob locations ā the question location and the answer location ā to share bits of information, rather than coming up with their own, possibly different bits of information. this increases the chances that the question is located āin a similar wayā to the answer.
š£ misato ā what does this mean, concretely?
š” ritsuko ā well, for example, they could have the same bits of information for how to find bits of memory on a computerās memory on earth, encoded in our physics, and then the two different ās and functions would only differ in what computer, what memory range, and what time they find their blobs in.
š” ritsuko ā for this, weāll define a set of ālocation priorsā being sampled as part of the hypothesis that samples over ā letās call it (xi). we might as well posit .
š” ritsuko ā weāll also define a kolmogorov simplicity measure which can use another piece of information, as, letās seeā¦
š” ritsuko ā there we go, measuring the simplicity of the pair of the prior and the element favors information being shared between them.
š£ misato ā wait, this fails to normalize now, doesnāt it? because not all of is sampled, only pairs whose first element is .
š” ritsuko ā ah, youāre right! we can simply normalize this distribution to solve that issue.
š” ritsuko ā and in weāll simply add and then pass around to all blob locations:
š” ritsuko ā finally, weāll use it in to sample from:
15. adjusting scores
š” ritsuko ā hereās an issue: currently in , weāre weighing hypotheses by how hard it is to find both the question and the answer.
š” ritsuko ā do you think thatās wrong?
š£ misato ā i think we should first ask for how hard it is to find questions, and then normalize the distribution of answers, so that harder-to-find answers donāt penalize hypotheses. the reasoning behind this is that we want QACI graphs to be able to do a lot of complicated things, and that we hope question location is sufficient to select what we want already.
š” ritsuko ā ah, that makes sense, yeah! thankfully, we can just normalize right around the call to , before applying it to :
š¢ shinji ā what happens if we donāt get the blob locations we want, exactly?
š” ritsuko ā well, it depends. there are two kinds of āblob mislocationsā: ānaiveā and āadversarialā ones. naive mislocations are hopefully not a huge deal; considering that weāre doing average scoring over all scoring functions weighed by mass, hopefully the āsignalā from our aligned scoring functions beats out the ānoiseā from locations that select the wrong thing at a random place, like āboltzmann blobsā.
š” ritsuko ā adversarial blobs, however, are tougher. i expect that they mostly result from unfriendly alien superintelligences, as well as earth-borne AI, both unaligned ones and ones that might result from QACI. against those, i hope that inside QACI we come up with some good decision theory that lets us not worry about that.
š£ misato ā actually, didnāt someone recently publish some work on a threat-resistant utility bargaining function, called āRoseā?
š” ritsuko ā oh, nice! well in that case, if is of type , then we can simply wrap it around all of :
š” ritsuko ā note that weāre putting the whole thing inside an anonymous -function, and assigning to the result of applying to that distribution.
16. observations
š¢ shinji ā you know, i feel like there ought to be some better ways to select hypotheses that look like our world.
š” ritsuko ā hmmm. you know, i do feel like if we had some āobservationā bitstring (mu) which strongly identifies our world, like a whole dump of wikipedia or something, that might help ā something like . but how do we tie that into the existing set of variables serving as a sampling?
š£ misato ā we could look for the question in futures of the observation world-stateā how do we get that world-state again?
š” ritsuko ā oh, if youāve got you an reconstitute the factual observation world-state with .
š£ misato ā in that case, we can just do:
š” ritsuko ā oh, neat! actually, couldnāt we generate two blobs and sandwich the question blob between the two?
š£ misato ā letās see here, the second observation can be ā¦
š£ misato ā how do i sample the location from both the future of and the past of ?
š” ritsuko ā well, iām not sure we want to do that. remember that tries to find the very first matching world-state for any . instead, how about this:
š” ritsuko ā itās a bit hacky, but we can simply demand that āthe world-state be in the future of the world-state more than the world-state is in the future of the world-stateā.
š£ misato ā huh. i guess thatāsā¦ one way to do it.
š¢ shinji ā could we encourage the blob location prior to use the bits of information from the observations? something likeā¦
š” ritsuko ā nope. because then, ās programs can simply return the observations as constants, rather than finding them in the world, which defeats the entire purpose.
š£ misato ā ā¦so, whatās in those observations, exactly?
š” ritsuko ā well, is mostly just going to be with āmore, newer contentā. but the core of it, , could be a whole lot of stuff. a dump of wikipedia, a callable of a some LLM, whatever else would let it identify our world.
š¢ shinji ā canāt we just, like, plug the AI into the internet and let it gain data that way or something?
š” ritsuko ā so thereās like obvious security concerns here. but, assuming those were magically fixed, i can see a way to do that: could be a function or mapping rather than a bitstring, and while the AI would observe it as a constant, it could be lazily evaluated. including, like, could be a fully memoized function ā such that the AI canāt observe any mutable state ā but it would still point to the world. in essence, this would make the AI point to the entire internet as its observation, though of course it would in practice be unable to obtain all of it. but it could navigate it just as if it was a mathematical object.
š£ misato ā interesting. though of course, the security concerns make this probably unviable.
š” ritsuko ā hahah. yeah. oh, and we probably want to pass inside :
17. where next
š£ misato ā so, is that it then? are we done?
š” ritsuko ā hardly! i expect that thereās a lot more work to be done. but this is a solid foundation, and direction to explore. itās kind of the only thing that feels like a path to saving the world.
š¢ shinji ā you know, the math can seem intimidating at first, but actually itās not that complicated. one can figure out this math, especially if they get to ask questions in real time to the person who invented that math.
š” ritsuko ā for sure! it should be noted that iām not particularly qualified at this. my education isnāt in math at all ā i never really did math seriously before QACI. the only reason why iām making the QACI math is that so far barely anyone else will. but iāve seen at least one other person try to learn about it and come to understand it somewhat well.
š¢ shinji ā what are some directions which you think are worth exploring, for people who want to help improve QACI?
š” ritsuko ā oh boy. well, here are some:
find things that are broken about the current math, and ideally help fix them too.
think about utility function bargaining more ā notably, perhaps scores are regularized, such as maybe by weighing ratings that are more āextremeā (further away from ) as less probable. alternatively, maybe scoring functions have a finite amount of āvotestuffā that they get to distribute amongst all options the way a normalizing distribution does, or maybe we implement something kinda like quadratic voting?
think about how to make a lazily evaluated observation viable. iām not sure about this, but it feels like the kind of direction that might help avoid unaligned alien AIs capturing our locations by bruteforcing blob generation using many-worlds.
generally figure out more ways to ensure that the blob locations match the world-states we want ā both by improving and , and by finding more clever ways to use them ā you saw how easy it was to add two blob locations for the two observations .
think about turning this scheme into a continuous rather than one-shot AI. (possibly exfohazardous, do not publish)
related to that, think about ways to make the AI aligned not just with regards to its guess, but also with regards to its side-effects, so as to avoid it wanting to exploit its way out. (possibly exfohazardous, do not publish)
alternatively, think about how to box the AI so that the output with regards to which it is aligned is its only meaningful source of world-steering.
one thing we didnāt get into much is what could actually be behind , , and . you can read more about those here, but i donāt have super strong confidence in the way theyāre currently put together. in particular, it would be great if someone who groks physics a lot more than me thought about whether many-worlds gives unaligned alien superintelligences the ability to forge any blob or observation we could put together in a way that would capture our AIās blob location.
maybe there are some ways to avoid this by tying the question world-state with the AIās action world-state? maybe implementing embedded agency helps with this? note that blob location can totally locate the AIās action, and use that to produce counterfactual action world-states. maybe that is useful. (possibly exfohazardous, do not publish)
think about and the function (see the full math post) and how to either implement it or achieve a similar effect otherwise. for example, maybe instead of relying on an expensive hash, we can formally define that need to be āconsequentialist agents trying to locate the blob in the way we wantā, rather than any program that works.
think about how to make counterfactual QACI intervals resistant to someone launching unaligned superintelligence within them.
š£ misato ā ack, i didnāt really think of that last one. yeah, that sounds bad.
š” ritsuko ā yup. in general, i could also do with people who could help with inner-alignment-to-a-formal-goal, but thatās a lot more hazardous to work on. hence why we have not talked about it. but there is work to be done on that front, and people who think they have insights should probly contact us privately and definitely not publish them. interpretability people are doing enough damage to the world as it is.
š¢ shinji ā well, things donāt look great, but iām glad this plan is around! i guess itās something.
š” ritsuko ā i know right? thatās how i feel as well. lol.
š£ misato ā lmao, even.
Probably not directly relevant to most of the post, but I think:
Is probably false.
It might be the case that humans are reliably not capable of inventing catastrophic AGI without a certain large minimum amount of compute, experimentation, and researcher thinking time which we have not yet reached. A superintelligence (or smarter humans) could probably get much further much faster, but thatās irrelevant in any worlds where higher-intelligence beings donāt already exist.
With hindsight and an inside-view look at past trends, you can retro-dict what the past of most timelines in our neighborhood probably look like, and conclude that most of them have probably not yet destroyed themselves.
It may be that going forward this trend does not continue: I do think most timelines including our own are heading for doom in the near future, and it may be that the history of the surviving ones will be full of increasingly implausible development paths and miraculous coincidences. But I think the past is still easily explained without any weird coincidences if you take a gears-level look at the way SoTA AI systems actually work and how they were developed.
Another potential downside of this approach: it places a lot of constraints on the AI itself, which means it probably has to be strongly superintelligent to start working at all.
I think an important desiderata of any alignment plan is that your AI system starts working gradually, with a ācapabilities dialā that you (and the aligned system itself) turn up just enough to save the world, and not more.
Intuitively, I feel like an aligned AGI should look kind of like a friendly superhero, whose superpower is weak superintelligence, superhuman ethics, and a morality which is as close as possible to the coherence-weighted + extrapolated average morality of all currently existing humans (probably not literally; Iām just trying to gesture at a general thing of averaging over collective extrapolated volition /ā morality /ā etc.).
Brought into existence, that superhero would then consider two broad classes of strategies:
Solve a bunch of hard alignment problems: embedded agency, stable self-improvement, etc. and then, having solved those, build a successor system to do the actual work.
Directly do some things with biotech /ā nanotech /ā computer security /ā etc. at its current intelligence level to end the acute risk period. Solve remaining problems at its leisure, or just leave them to the humans.
From my own not-even-weakly superhuman vantage point, (2) seems like a much easier and less fraught strategy than (1). If I were a bit smarter, Iād try saving the world without AI or enhancing myself any further than I absolutely needed to.
Faced with the problem that the boxed AI in the QACI scheme is facingā¦ :shrug:. I guess Iād try some self-enhancement followed by solving problems in (1), and then try writing code for a system that does (2) reliably. But it feels like Iād need to be a LOT smarter to even begin making progress.
Provably safely building the first āfriendly superheroā might require solving some hard math and philosophical problems, for which QACI might be relevant or at least in the right general neighborhood. But that doesnāt mean that the resulting system itself should be doing hard math or exotic philosophy. Here, I think the intuition of more optimistic AI researchers is actually right: an aligned human-ish level AI looks closer to something that is just really friendly and nice and helpful, and also super-smart.
(I havenāt seen any plans for building such a system that donāt seem totally doomed, but the goal itself still seems much less fraught than targeting strong superintelligence on the first try.)
Provably false, IMO. What makes such AI deadly isnāt its consequentialism, but its capability. Any such AI that:
isnāt smart enough to consistently and successfully deceive most humans, and
isnāt smart enough to improve itself
is containable and ultimately not an existential threat, just like a human consequentialist wouldnāt be. We even have an example of this, someone rigged together ChaosGPT, an AutoGPT agent with the explicit goal of destroying humanity, and all it can do is mumble to itself about nuclear weapons. You could argue itās not pursuing its goal coherently enough, but thatās exactly the point, itās too dumb. Self improvement is the truly dangerous threshold. Unfortunately thatās not a very high one (probably being somewhen at the upper end of competent human engineers and scientific).
Yes, this is exactly the reason why you shouldnāt update on āantropic evidenceā and base your assumptions on it. The example with quantum russian roulette is a bit of a loaded one (pun intended), but here is the general case:
You have a model of reality, you gather some evidence which seem to contradict this model. Now you can either update your model, or double down on it, claiming that all the evidence is a bunch of outliners.
Updating on antropics in such situation is refusing to update your model when it contradicts the evidence. Itās adopting an anti-laplacian prior while reasoning about life or death (survival or extinction) situationsāgoing a bit insane specifically in the circumstances with the highest stakes possible.
Still donāt buy this ārealityfluidā business. Certainly not in the āBorn measure is measure of realnessā sense. Itās not necessary to conclude that some number is realness just because otherwise your epistemology doesnāt work so goodāitās not a law, that you must find surprising finding yourself in a branch with high measure, when all branches are equally real. Them all being equally real doesnāt contradict observations, itās just means the policy of expecting stuff to happen according to Born probabilities gives you high Born measure of knowing about Born measure, not real knowledge about reality.
thatās fair, but if āamounts of how much this is mattersā/āāamount of how much this is realā is not āamounts of how much you expect to observe thingsā, then how could we possibly determine what it is? (see also this)
I think that expecting to observe things according to branch counting instead of Born probabilities is a valid choice. Anything bad happens if you do it only of you already care about Born measure.
But if the question is āhow do you use observations to determine whatās realā thanāindirectly by using observations to figure out that QM is true? Not sure if even this makes sense without some preference for high measure, but maybe it is. Maybe by only excluding possibility of your branch not existing, once your observe it? And valuing the measure of you indirectly knowing about realness of everything is not incoherent too ĀÆ\(ć)/āĀÆ. I more in āadvocating for people to figure it out in more detailā stage, than having any answers^^.
I disagree. Itās only rational if you already value having high Born measure. Otherwise what bad thing happens if you expect to observe every quantum outcome with equal probability? Itās not that you would be wrong. Itās just that Born measure of you in the state of being wrong will be high. But no one forces you to care about that. And other valuable things, like consciousness, work fine with arbitrary low measure.
Yeah, but why you canāt use uniform density? Or I donāt know, Iām bad at math, maybe something else analogous to branch counting in discrete case. And you would need to somehow define āyouā and other parts of your preferences in term of continuous space anywayāthere is no reason this definition have to involve Born measure.
Iām not against distributions in general. Iām just saying that conditional on MWI there is no uncertainty about quantum outcomesāthey all happen.
But thatās not what the (interpretation of the) equations say(s). The equations say that all sequences of 0s and 1s exist and you will observe all of them.
They only concord with long-run empirical frequencies in regions of configuration space with high Born measure. They donāt concord with, for example, average frequencies across all observers of the experiment.
The point is there is no (quantum-related) uncertainty about moon being destroyedāit will be destroyed and also will be saved. My actions then should depend on how I count/āweight moons across configuration space. And that choice of weights depends on arbitrary preferences. I may as well stop caring about the moon after two days.
Does the one-shot AI necessarily aim to maximize some function (like the probability of saving the world, or the expected āsavednessā of the world or whatever), or can we also imagine a satisficing version of the one-shot AI which ājust tries to save the worldā with a decent probability, and doesnāt aim to do any more, i.e., does not try to maximize that probability or the quality of that saved world etc.?
Iām asking this because
I suspect that we otherwise might still make a mistake in specifying the optimization target and incentivize the one-shot AI to do something that āoptimallyā saves the world in some way we did not foresee and donāt like.
I try to figure out whether your plan would be hindered by switching from an optimization paradigm to a satisficing paradigm right now in order to buy time for your plan to be put into practice :-)
Maybe I am getting hung over the particular wording but are we assuming our agent has arbitrary computation power when we say they have a top down model of the universe? Is this a fair assumption to make or does this arbitrarily constrain our available actions.
where did you get to in the post? i believe this is addressed afterwards.
Is it?
It seems like this assumption is used later on.
For example, I am a little confused by the reality fluid section but if itās just the probability an output is real, I feel like we canāt just arbitrarily decide to 1/ān^2 (justifying it by ocaams razor doesnāt seem very mathematical and this is counterintuitive to real life). This seems to give our program arbitrary amounts of precision.
Furthermore associating polynomial computational complexity with this measure of realness and NP with unreal ness also seems very odd to me. There are many simple P programs that are incomputable and NP outputs can correspond with realness. Iām not sure if Iām just wholly misunderstanding this section, but the justification for all this is just odd, we are assuming because reality exists, it must be computable essentially?
Intuitively simulating the universe with a quantum computer seems very hard as well. Donāt see why it would be strange for it to be hard. I am not qualified to evaluate that claim, but it seems extraordinary enough to require someone with the background to chime in.
Furthermore, donāt really see how you can practically get an Oracle with Turing jumps.
Iām not sure how important this math is for the rest of the section, but it seems like we use this oracle to answer questions.