an Evangelion dialogue explaining the QACI alignment plan

Link post

this post was written by Tamsin Leake at Orthogonal.
thanks to Julia Persson and mesaoptimizer for their help putting it together.

no familiarity with the Evangelion anime is required to understand this post, and it pretty much doesnā€™t contain any spoilers.

this post explains the justification for, and the math formalization of, the QACI plan for formal-goal alignment. you might also be interested in its companion post, formalizing the QACI alignment formal-goal, which just covers the math in a more straightforward, bottom-up manner.

1. agent foundations & anthropics

šŸŸ£ misato ā€” hi ritsuko! so, howā€™s this alignment stuff going?

šŸŸ” ritsuko ā€” well, i think iā€™ve got an idea, but youā€™re not going to like it.

šŸŸ¢ shinji ā€” thatā€™s exciting! what is it?

šŸŸ” ritsuko ā€” so, you know how in the sequences and superintelligence, yudkowsky and bostrom talk about how hard it is to fully formalize something which leads to nice things when maximized by a utility function? so much so that it serves as an exercise to think about oneā€™s values and consistently realize how complex they are?

šŸŸ” ritsuko ā€” ah, yes, the good old days when we believed this was the single obstacle to alignment.

šŸ”“ asuka barges into the room and exclaims ā€” hey, check this out! i found this fancy new theory on lesswrong about how ā€œshards of valueā€ emerge in neural networks!

šŸ”“ asuka then walks away while muttering something about eiffel towers in rome and waluigi hyperstitionā€¦

šŸŸ” ritsuko ā€” indeed. these days, all these excited kids running around didnā€™t learn about AI safety by thinking really hard about what agentic AIs would do ā€” they got here by being spooked by large language models, and as a result theyā€™re thinking in all kinds of strange directions, like what it means for a language model to be aligned or how to locate natural abstractions for human values in neural networks.

šŸŸ¢ shinji ā€” of course thatā€™s what weā€™re looking at! look around you, turns out that the shape of intelligence is RLHFā€™d language models, not agentic consequentialists! why are you still interested in those old ideas?

šŸŸ” ritsuko ā€” the problem, shinji, is that we canā€™t observe agentic AI being published before alignment is solved. when someone figures out how to make AI consequentialistically pursue a coherent goal, whether by using current ML technology or by building a new kind of thing, we die shortly after they publish it.

šŸŸ£ misato ā€” wait, isnā€™t that anthropics? iā€™d rather stay away from that type of thinking, it seems too galaxybrained to reason aboutā€¦

šŸŸ” ritsuko ā€” you canā€™t really do that either ā€” the ā€œback to square oneā€ interpretation of anthropics, where you donā€™t update at all, is still an interpretation of anthropics. itā€™s kind of like being the kind of person who, when observing having survived quantum russian roulette 20 times in a row, assumes that the gun is broken rather than saying ā€œi guess i might have low quantum amplitude nowā€ and fails to realize that the gun can still kill them ā€” which is bad when all of our hopes and dreams rests on those assumptions. the only vaguely anthropics-ignoring perspective one can take about this is to ignore empirical evidence and stick to inside view, gears-level prediction of how convergent agentic AI tech is.

šŸŸ£ misato ā€” ā€¦is it?

šŸŸ” ritsuko ā€” of course it is! on inside view, all the usual MIRI arguments hold just fine. it just so happens that if you keep running a world forwards, and select only for worlds that we havenā€™t died in, then youā€™ll start observing stranger and stranger non-consequentialist AI. youā€™ll start observing the kind of tech we get when just dumbly scale up bruteforce-ish methods like machine learning and you observe somehow nobody publishing insights as to how to make those systems agentic or consequentialistic.

šŸŸ¢ shinji ā€” thatā€™s kind of frightening!

šŸŸ” ritsuko ā€” well, itā€™s where we are. we already thought we were small in space, now we also know that weā€™re also small in probabilityspace. the important part is that it doesnā€™t particularly change what we should do ā€” we should still try to save the world, in the most straightforward fashion possible.

šŸŸ£ misato ā€” so all the excited kids running around saying we have to figure out how to align language models or whateverā€¦

šŸŸ” ritsuko ā€” theyā€™re chasing a chimera. impressive LLMs are not what we observe because theyā€™re what powerful AI looks like ā€” theyā€™re what we observe because theyā€™re what powerful AI doesnā€™t look like. theyā€™re there because thatā€™s as impressive as you can get short of something that kills everyone.

šŸŸ£ misato ā€” iā€™m not sure most timelines are dead yet, though.

šŸŸ” ritsuko ā€” we donā€™t know if ā€œmostā€ timelines are alive or dead from agentic AI, but we know that however many are dead, we couldnā€™t have known about them. if every AI winter was actually a bunch of timelines dying, we wouldnā€™t know.

šŸŸ£ misato ā€” you know, this doesnā€™t necessarily seem so bad. considering that confused alignment people is whatā€™s caused the appearance of the three organizations trying to kill everyone as fast as possible, maybe itā€™s better that alignment research seems distracted with things that arenā€™t as relevant, rather than figuring out agentic AI.

šŸŸ” ritsuko ā€” you can say that alright! thereā€™s already enough capability hazards being carelessly published everywhere as it is, including on lesswrong. if people were looking in the direction of the kind of consequentialist AI that actually determines the future, this could cause a lot of damage. good thing thereā€™s a few very careful people here and there, studying the right thing, but being very careful by not publishing any insights. but this is indeed the kind of AI we need to figure out if we are to save the world.

šŸŸ¢ shinji ā€” whatever kind of anthropic shenanigans are at play here, they sure seem to be saving our skin! maybe weā€™ll be fine because of quantum immortality or something?

šŸŸ£ misato ā€” thatā€™s not how things work shinji. quantum immortality explains how you got here, but doesnā€™t help you save the future.

šŸŸ¢ shinji sighs, with a defeated look on his face ā€” ā€¦so weā€™re back to the good old MIRI alignment, we have to perfectly specify human values as a utility function and figure out how to align AI to it? this seems impossible!

šŸŸ” ritsuko ā€” well, thatā€™s where things get interesting! now that weā€™re talking about coherent agents whose actions we can reason about, agents whose instrumentally convergent goals such as goal-content integrity would be beneficial if they were aligned, agents who wonā€™t mysteriously turn bad eventually because theyā€™re not yet coherent agents, we can actually get to work putting something together.

šŸŸ£ misato ā€” ā€¦and thatā€™s what youā€™ve been doing?

šŸŸ” ritsuko ā€” well, thatā€™s kind of what agent foundations had been about all along, and what got rediscovered elsewhere as ā€œformal-goal alignmentā€: designing an aligned coherent goal and figuring out how to make an AI that is aligned to maximizing it.

2. embedded agency & untractability

šŸŸ¢ shinji ā€” so whatā€™s your idea? i sure could use some hope right now, though i have no idea what an aligned utility function would even look like. iā€™m not even sure what kind of type signature it would have!

šŸŸ” ritsuko smirks ā€” so, the first important thing to realize is that the challenge of designing an AI that emits output which save the world, can be formulated like this: design an AI trying to solve a mathematical problem, and make the mathematical problem be analogous enough to ā€œwhat kind of output would save the worldā€ that the AI, by solving it, happens to also save our world.

šŸŸ¢ shinji ā€” but what does that actually look like?

šŸŸ£ misato ā€” maybe it looks like ā€œwhat output should you emit, which would cause your predicted sequence of stimuli to look like a nice world?ā€

šŸŸ” ritsuko ā€” what do you think actually happens if an AI were to succeed at this?

šŸŸ£ misato ā€” oh, i guess it would hack its stimuli input, huh. is there even a way around this problem?

šŸŸ” ritsuko ā€” what youā€™re facing is a facet of the problem of embedded agency. you must make an AI which thinks about the world which contains it, not just about a system that it feels like it is interacting with.

šŸŸ” ritsuko ā€” the answer ā€” as in PreDCA ā€” is to model the world from the top-down, and ask: ā€œlook into this giant universe. youā€™re in there somewhere. which action should the you-in-there-somewhere take, for this world to have the most expected utility?ā€

šŸŸ¢ shinji ā€” expected utility? by what utility function?

šŸŸ” ritsuko ā€” weā€™re coming to it, shinji. there are three components to this: the formal-goal-maximizing AI, the formal-goal, and the glue in-between. embedded agency and decision theory are parts of this glue, and theyā€™re core to how we think about the whole problem.

šŸŸ£ misato ā€” and this top-down view works? how the hell would it compute the whole universe? isnā€™t that uncomputable?

šŸŸ” ritsuko ā€” how the hell do you expect AI would have done expected utility maximization at all? by making reasonable guesses. i canā€™t compute the whole universe from the big-bang up to you right now, but if you give me a bunch of math which iā€™d understand to say ā€œin worlds being computed forwards starting at some simple initial state and eventually leading to this room right now with shinji, misato, ritsuko in it, what is shinji more likely to be thinking about: his dad, or the popeā€™s uncle?ā€

šŸŸ” ritsuko ā€” on the one hand, the question is immensely computationally expensive ā€” it asks to compute the entire history of the universe up to this shinji! but on the other hand, it is talking about a world which we inhabit, and about which we have the ability to make reasonable guesses. if we build an AI that is smarter than us, you can bet itā€™ll bet able to make guesses at least as well as this.

šŸŸ£ misato ā€” iā€™m not convinced. after all, we relied on humans to make this guess! of course you can guess about shinji, youā€™re a human like him. why would the AI be able to make those guesses, being the alien thing that it is?

šŸŸ” ritsuko ā€” i mean, one of its options is to ask humans around. itā€™s not like it has to do everything by itself on its single computer, here ā€” weā€™re talking about the kind of AI that agentically saves the world, and has access to all kinds of computational resources, including humans if needed. i donā€™t think itā€™ll actually need to rely on human compute a lot, but the fact that it can serves as a kind of existence proof for its ability to produce reasonable solutions to these problems. not optimal solutions, but reasonable solutions ā€” eventually, solutions that will be much better than any human or collection of humans could be able to come up with short of getting help from aligned superintelligence.

šŸŸ¢ shinji ā€” but what if the worlds that are actually described by such math are not in fact this world, but strange alien worlds that look nothing like ours?

šŸŸ” ritsuko ā€” yes, this is also part of the problem. but letā€™s not keep moving the goalpost here. there are two problems: make the formal problem point to the right thing (the right shinji in the right world), and make an AI that is good at finding solutions to that problem. both seem like we can solve them with some confidence; but we canā€™t just keep switching back and forth between the two.

šŸŸ” ritsuko ā€” if you have to solve two problems A and B, then you have to solve A assuming B is solved, and then solve B assuming A is solved. then, youā€™ve got a pair of solutions which work with one another. here, weā€™re solving the problem of whether an AI would be able to solve this problem, assuming the problem points to the right thing; later weā€™ll talk about how to make the problem point to the right thing assuming we have an AI that can solve it.

šŸŸ¢ shinji ā€” are there any actual implementation ideas for how to build such a problem-solving AI? it sure sounds difficult to me!

šŸŸ£ misato, carefully peeking into the next room ā€” hold on. iā€™m not actually quite sure whoā€™s listening ā€” it is known that capabilities people like to lurk around here.

šŸŸ¤ kaji can be seen standing against a wall, whistling, pretending not to hear anything.

šŸŸ” ritsuko ā€” right. one thing i will reiterate, is that we should not observe a published solution to ā€œhow to get powerful problem-solving AIā€ before the world is saved. this is in the class of problems which we die shortly after a solution to it is found and published, so our lack of observing such a solution is not much evidence for its difficulty.

3. one-shot AI

šŸŸ” ritsuko ā€” anyways, to come back to embedded agency.

šŸŸ£ misato ā€” ah, i had a question. the AI returns a first action which it believes would overall steer the world in a direction that maximizes its expected utility. and then what? how does it get its observation, update its model, and take the next action?

šŸŸ” ritsuko ā€” well, there are a variety of clever schemes to do this, but an easy one is to just not.

šŸŸ£ misato ā€” what?

šŸŸ” ritsuko ā€” to just not do anything after the first action. i think the simplest thing to build is what i call a ā€œone-shot AIā€, which halts after returning an action. and then we just run the action.

šŸŸ¢ shinji ā€” ā€œrun the action?ā€

šŸŸ” ritsuko ā€” sure. we can decide in advance that the action will be a linux command to be executed, for example. the scheme does not really matter, so long as the AI gets an output channel which has pretty easy bits of steering the world.

šŸŸ£ misato ā€” hold on, hold on. a single action? what do you intend for the AI to do, output a really good pivotal act and then hope things get better?

šŸŸ” ritsuko ā€” have a little more imagination! our AI ā€” letā€™s call it AIā‚€ ā€” will almost certainly return a single action that builds and then launches another, better AI, which weā€™ll call AIā‚. a powerful AI can absolutely do this, especially if it has the ability to read its own source-code for inspiration, but probably even without that.

šŸŸ” ritsuko ā€” ā€¦and because itā€™s solving the problem ā€œwhat action would maximize utility when inserted into this worldā€, it will understand that AIā‚ needs to have embedded agency and the various other aspects that are instrumental to it ā€” goal-content integrity, robustly delegating RSI, and so on.

šŸŸ¢ shinji ā€” ā€œRSIā€? whatā€™s that?

šŸŸ£ misato sighs ā€” you know, it keeps surprising me how many youths donā€™t know about the acronym RSI, which stands for Recursive Self-Improvement. itā€™s pretty indicative of how little theyā€™re thinking about it.

šŸŸ¢ shinji ā€” i mean, of course! recursive self-improvement is an obsolete old MIRI idea that doesnā€™t apply to the AIs we have today.

šŸŸ£ misato ā€” right, kids like you got into alignment by being spooked by chatbots. (what silly things do they even teach you in class these days?)

šŸŸ£ misato ā€” you have to realize that the generation before you, the generation of ritsuko and i, didnā€™t have the empirical evidence that AI was gonna be impressive. we started on something like the empty string, or at least coherent arguments where we had to actually build a gears-level inside-view understanding of what AI would be like, and what it would be capable of.

šŸŸ£ misato ā€” to me, one of the core arguments that sold me on the importance of AI and alignment was recursive self-improvement ā€” the idea that AI being better than humans at designing AI would be a very special, very critical point in time, downstream of which AI would be able to beat humans at everything.

šŸŸ¢ shinji ā€” but this turned out irrelevant, because AI is getting better than humans without RSIā€“

šŸŸ” ritsuko ā€” again, false. we can only observe AI getting better than humans at intellectual tasks without RSI, because when RSI is discovered and published, we die very shortly thereafter. you have a sort of consistent survivorship bias, where you keep thinking of a whole class of things as irrelevant because they donā€™t seem impactful, when in reality theyā€™re the most impactful; theyā€™re so impactful that when they happen you die and are unable to observe them.

4. action scoring

šŸŸ£ misato ā€” so, i think i have a vague idea of what youā€™re saying, now. top-down view of the universe, which is untractable but thatā€™s fine apparently, thanks to some mysterious capabilities; one-shot AI to get around various embedded agency difficulties. whatā€™s the actual utility function to align to, now? iā€™m really curious. i imagine a utility function assigns a value between 0 and 1 to any, uh, entire world? world-history? multiverse?

šŸŸ” ritsuko ā€” it assigns a value between 0 and 1 to any distribution of worlds, which is general enough to cover all three of those cases. but letā€™s not get there yet; remember how the thing weā€™re doing is untractable, and weā€™re relying on an AI that can make guesses about it anyways? weā€™re gonna rely on that fact a whole lot more.

šŸŸ£ misato ā€” oh boy.

šŸŸ” ritsuko ā€” so, first: weā€™re not passing a utility function. weā€™re passing a math expression describing an ā€œaction-scoring functionā€ ā€” that is to say, a function attributing scores to actions rather than to distributions over worlds. weā€™ll make the program deterministic and make it ignore all input, such that the AI has no ability to steer its result ā€” its true result is fully predetermined, and the AI has no ability to hijack that true result.

šŸŸ£ misato ā€” wait, ā€œhijack itā€? arenā€™t we assuming an inner-aligned AI, here?

šŸŸ” ritsuko ā€” i donā€™t like this term, ā€œinner-alignedā€; just like ā€œAGIā€, people use it to mean too many different and unclear things. weā€™re assuming an AI which does its best to pick an answer to a math problem. thatā€™s it.

šŸŸ” ritsuko ā€” we donā€™t make an AI which tries to not be harmful with regards to its side-channels, such as hardware attacks ā€” except for its output, it needs to be strongly boxed, such that it canā€™t destroy our world by manipulating software or hardware vulnerabilities. similarly, we donā€™t make an AI which tries to output a solution we like, it tries to output a solution which the math would score high. narrowing what we want the AI to do greatly helps us build the right thing, but it does add constraints to our work.

šŸŸ” ritsuko starts scribbling on a piece of paper on her desk ā€” letā€™s write down some actual math here. letā€™s call the set of world-states, distributions over world-states, and be the set of actions.

šŸŸ¢ shinji ā€” what are the types of all of those?

šŸŸ” ritsuko ā€” letā€™s not worry about that, for now. all we need to assume for the moment is that those sets are countable. we could define both and ā€” define them both as the set of finite bitstrings ā€” and this would functionally capture all we need. as for distributions over world-states , weā€™ll define for any countable set , and weā€™ll call ā€œmassā€ the number which a distribution associates to any element.

šŸŸ£ misato ā€” woah, woah, hold on, i havenā€™t looked at math in a while. what do all those squiggles mean?

šŸŸ” ritsuko ā€” is defined as the set of functions , which take an and return a number between and , such that if you take the of all ā€™s in and add those up, you get a number not greater than . note that i use a notation of sums where the variables being iterated over are above the and the constraints that must hold are below it ā€” so this sum adds up all of the for each such that .

šŸŸ£ misato ā€” um, sure. i mean, iā€™m not quite sure what this represents yet, but i guess i get it.

šŸŸ” ritsuko ā€” the set of distributions over is basically like saying ā€œfor any finite amounts of mass less than 1, what are some ways to distribute that mass among some or all of the ā€™s?ā€ each of those ways is a distribution; each of those ways is an in .

šŸŸ” ritsuko ā€” anyways. the AI will take as input an untractable math expression of type , and return a single . note that weā€™re in math here, so ā€œis of typeā€ and ā€œis in setā€ are really the same thing; weā€™ll use to denote both set membership and type membership, because theyā€™re the same concept. for example, is the set of all functions taking as input an and returning a ā€” returning a real number between and .

šŸŸ¢ shinji ā€” hold on, a real number?

šŸŸ” ritsuko ā€” well, a real number, but weā€™re passing to the AI a discrete piece of math which will only ever describe countable sets, so weā€™ll only ever describe countably many of those real numbers. infinitely many, but countably infinitely many.

šŸŸ£ misato ā€” so the AI has type , and we pass it an action-scoring function of type to get an action. checks out. where do utility functions come in?

šŸŸ” ritsuko ā€” they donā€™t need to come in at all, actually! weā€™ll be defining a piece of math which describes the world for the purpose of pointing at the humans who will decide on a scoring function, but the scoring function will only be over actions the AI should take.

šŸŸ” ritsuko ā€” the AI doesnā€™t need to know that its math points to the world itā€™s in; and in fact, conceptually, it isnā€™t told this at all. on a fundamental, conceptual manner, it is not being told to care about the world itā€™s in ā€” if it could, it would take over our world and kill everyone in it to acquire as much compute as possible, and plausibly along the way drop an anvil on its own head because it doesnā€™t have embedded agency with regards to the world around itself.

šŸŸ” ritsuko ā€” we will just very carefully box it such that its only meaningful output into our world, the only bits of steering it can predictably use, are those of the action it outputs. and we will also have very carefully designed it such that the only thing it ultimately cares about, is that that output have as high of an expected scoring as possible ā€” it will care about this intrinsically, and nothing else intrinsically, such that doing that will be more important than hijacking our world through that output.

šŸŸ” ritsuko ā€” this meaning of ā€œinner-alignmentā€ is still hard to accomplish, but it is much better defined, much narrower, and thus hopefully much easier to accomplish than the ā€œfullā€ embedded-from-the-start alignments which very slow, very careful corrigibility-based AI alignment would result in.

5. early math & realityfluid

šŸŸ£ misato ā€” so what does that scoring function actually look like?

šŸŸ” ritsuko ā€” you know what, i hadnā€™t started mathematizing my alignment idea yet; this might be a good occasion to get started on that!

šŸŸ” ritsuko wheels in a whiteboard ā€” so, what i expect is that the order in which weā€™re gonna go over the math is going to be the opposite order to that of the final math report on QACI. here, weā€™ll explore things from the top-down, filling in details as we go ā€” whereas the report will go from the bottom-up, fully defining constructs and then using them.

šŸŸ” ritsuko ā€” this is roughly what weā€™ll be doing here. go over all hypotheses the AI could have within some set of hypotheses, called ; measure their probability, the that they correspond to our world, and how good the are in them. this is the general shape of expected scoring for actions.

šŸŸ¢ shinji ā€” wait, the set of hypotheses is called , not ? thatā€™s a bit confusing.

šŸŸ” ritsuko ā€” this is pretty standard in math, shinji. the reason to call the set of hypotheses is because, as explained before, sets are also types, and so will be of type rather than .

šŸŸ£ misato ā€” whatā€™s in a , exactly?

šŸŸ” ritsuko ā€” the set of all relevant beliefs about things. or rather, the set of all relevant beliefs except for logical facts. logical uncertainty will be a thing on the AIā€™s side, not in the math ā€” this math lives in the realm ā€œplatonic perfect true mathā€, and the AI will have beliefs about what its various parts tend to result in as one kind of logical belief, just like itā€™ll have beliefs about other logical facts.

šŸŸ£ misato ā€” so, a mathematical object representing empirical beliefs?

šŸŸ” ritsuko ā€” i would rather put it as a pair of: beliefs about whatā€™s real (ā€œrealityfluidā€ beliefs); and beliefs about where, in the set of real things, the AI is (ā€œindexicalā€ beliefs). but this can be simplified by allocating realityfluid across all mathematical/ā€‹computational worlds (this is equivalent to assuming tegmark the level 4 multiverse is real, and can be done by assuming the cosmos to be a ā€œuniversal completeā€ program running all computations) and then all beliefs are indexical. these two possibilities work out to pretty much the same math, anyways.

šŸŸ¢ shinji ā€” what the hell is ā€œrealityfluidā€???

šŸŸ” ritsuko ā€” itā€™s a very long story, iā€™m afraid.

šŸŸ£ misato ā€” think of it as a measure of how some constant amount of ā€œmatteringnessā€/ā€‹ā€realnessā€ ā€” typically 1 unit of it ā€” is distributed across possibilities. even though it kinda mechanistically works like probability mass, itā€™s ā€œin the other directionā€: it represents whatā€™s actually real, rather than representing what we believe.

šŸŸ¢ shinji ā€” why would it sum to 1? what if thereā€™s an infinite amount of stuff out there?

šŸŸ£ misato ā€” your realityfluid still needs to sum up to some constant. if you allocate an infinite amount of matteringness, things break and donā€™t make sense.

šŸŸ” ritsuko ā€” indeed. this is why the most straightforward way to allocate realityfluid is to just imagine that the set of all that exists is a universal program whose computation is cut into time-steps each doing a constant amount of work, and then allocate some diminishing quantities of realityfluid to each time step.

šŸŸ£ misato ā€” like saying that compute step number has realityfluid?

šŸŸ” ritsuko ā€” that would indeed normalize, but it diminishes exponentially fast. this makes world-states exponentially unlikely in the amount of compute they exist after; and there are philosophical reasons to say that exponential unlikelyness is what should count as non-existing.

šŸŸ¢ shinji ā€” what the hell are you talking about??

šŸŸ” ritsuko hands shinji a paper called ā€œWhy Philosophers Should Care About Computational Complexityā€ ā€” look, this is a whole other tangent, but basically, polynomial amounts of computation corresponds to ā€œdoing somethingā€, whereas exponential amounts of computation correspond to ā€œmagically obtaining something out of the etherā€, and this sort-of ramificates naturally across the rest of computational complexity applied to metaphysics and philosophy.

šŸŸ” ritsuko ā€” so instead, we can say that computation step number has realityfluid. this only diminishes quadratically, which is satisfactory.

šŸŸ” ritsuko ā€” oh, and for the same reason, the universal program needs to be quantum ā€” for example, it needs to be a quantum equivalent of the classical universal program but for quantum computation, implemented on something like a quantum turing machine). otherwise, unless BQP=BPP, quantum multiverses like ours might be exponentially expensive to compute, which would be strange.

šŸŸ¢ shinji ā€” why ? why not or ?

šŸŸ” ritsuko ā€” those do indeed all normalize ā€” but we pick because at some point you just have to pick something, and is a natural, occam/ā€‹solomonoff-simple number which works. look, justā€“

šŸŸ¢ shinji ā€” and why are we assuming the universe is made of discrete computation anyways? isnā€™t stuff made of real numbers?

šŸŸ” ritsuko sighs ā€” look, this is what the church-turing-deutsch principle is about. for any universe made up of real numbers, you can approximate it thusly:

  • compute 1 step of it with every number truncated to its first 1 binary digit of precision

  • compute 1 step of it with every number truncated to its first 2 binary digits of precision

for 1 time step with 1 bit of precision, then 2 time steps with 2 bits of precision, then 3 with 3, and so on. for any piece of branch-spacetime which is only finitely far away from the start of its universe, there exists a threshold at which it starts being computed in a way that is indistinguishable from the version with real numbers.

šŸŸ¢ shinji ā€” but theyā€™re only an approximation of us! theyā€™re not the real thing!

šŸŸ” ritsuko sighs ā€” you donā€™t know that. you could be the approximation, and you would be unable to tell. and so, we can work without uncountable sets of real numbers, since theyā€™re unnecessary to explain observations, and thus an unnecessary assumption to hold about reality.

šŸŸ¢ shinji, frustrated ā€” i guess. it still seems pretty contrived to me.

šŸŸ” ritsuko ā€” what else are you going to do? youā€™re expressing things in math, which is made of discrete expressions and will only ever express countable quantities of stuff. there is no uncountableness to grab at and use.

šŸŸ£ misato ā€” actually, canā€™t we introduce turing jumps/ā€‹halting oracles into this universal program? i heard that this lets us actually compute real numbers.

šŸŸ” ritsuko ā€” thereā€™s kind-of-a-sense in which thatā€™s true. we could say that the universal program has access to a first-degree halting oracle, or a 20th-degree; or maybe it runs for 1 step with a 1st degree halting oracle, then 2 steps with a 2nd degree halting oracle, then 3 with 3, and so on.

šŸŸ” ritsuko ā€” your program is now capable, at any time step, of computing an infinite amount of stuff. letā€™s say one of those steps happens to run an entire universe of stuff, including a copy of us. how do you sub-allocate realityfluid? how much do we expect to be in there? you could allocate sub-compute-steps ā€” with a 1st degree halting oracle executing at step , you allocate realityfluid to each of the infinite sub-steps in the call to the halting-oracle. youā€™re just doing discrete realityfluid allocation again, except now your some of the realityfluid in your universe is allocated at people who have obtained results from a halting oracle.

šŸŸ” ritsuko ā€” this works, but what does it get you? assuming halting oracles is kind of a very strange thing to do, and regular computation with no halting oracles is already sufficient to explain this universe. so we donā€™t. but sure, we could.

šŸŸ¢ shinji ruminates, unsure where to go from there.

šŸŸ£ misato interrupts ā€” hey, do we really need to cover this? letā€™s say you found out that this whole view of things is wrong. could you fix your math then, to whatever is the correct thing?

šŸŸ” ritsuko waves around ā€” what?? what do you mean if itā€™s wrong?? iā€™m not rejecting the premise that i might be wrong here, but like, my answer here depends a lot on in what way iā€™m wrong and what is the better /ā€‹ more likely correct thing. so, i donā€™t know how to answer that question.

šŸŸ£ misato snaps shinji back to attention ā€” thatā€™s fair enough, i guess. well, letā€™s get back on track.

6. precursor assistance

šŸŸ” ritsuko ā€” so, one insight i got for my alignment idea came from PreDCA, which stands for Precursor Detection, Classification, and Assistance. it consists of mathematizations for:

  • the AI locating itself within possibilities

  • locating the high-agenticness-thing which had lots of causation-bits onto itself ā€” call it the ā€œPrecursorā€. this is supposed to find the human user who built/ā€‹launched the AI. (Detection)

  • bunch of criteria to ensure that the precursor is the intended human user and not something else (Classification)

  • extrapolating that precursorā€™s utility function, and maximizing it (Assistance)

šŸŸ£ misato ā€” what the hell kind of math would accomplish that?

šŸŸ” ritsuko ā€” well, itā€™s not entirely clear to me. some of it is explained, other parts seem like theyā€™re expected to just work naturally. in any case, this isnā€™t so important ā€” the ā€œLearning Theoretic Agendaā€ into which PreDCA fits is not fundamentally similar to mine, and i do not expect it to be the kind of thing that saves us in time. as far as i predict, that agenda has purchased most of the dignity points it will have cashed out when alignment is solved, when it inspired my own ideas.

šŸŸ¢ shinji ā€” and your agenda saves us in time?

šŸŸ” ritsuko ā€” a lot more likely so, yes! for one, i am not trying to build an entire theory of intelligence and machine learning, and iā€™m not trying to develop an elegant new form of bayesianism whose model of the world has concerning philosophical ramifications which, while admittedly possibly only temporary, make me concerned about the coherency of the whole edifice. what i am trying to do, is hack together the minimum viable world-saving machine about which weā€™d have enough confidence that launching it is better expected value than not launching it.

šŸŸ” ritsuko ā€” anyways, the important thing is that that idea made me think ā€œhey, what else could we do to even more make sure the selected precursor is the human use we want, and not something else like a nearby fly or the process of evolution?ā€ and then i started to think of some clever schemes for locating the AI in a top-down view of the world, without having to decode physics ourselves, but rather by somehow pointing to the user ā€œthroughā€ physics.

šŸŸ£ misato ā€” what does that mean, exactly?

šŸŸ” ritsuko ā€” well, remember how PreDCA points to the user from-the-top-down? the way it tries to locate the user is by looking for patterns, in the giant computation of the universe, which satisfy these criteria. this fits in the general notion of generalized computation interpretability, which is fundamentally needed to care about the world because you want to detect not just simulated moral patients, but arbitrarily complexly simulated moral patients. so, you need this anyways, and it is what ā€œlooking inside the world to find stuff, no matter how itā€™s encodedā€ looks like.

šŸŸ£ misato ā€” and what sort of patterns are we looking for? what are the types here?

šŸŸ” ritsuko ā€” as far as i understand, PreDCA looks for programs, or computations, which take some input and return an policy. my own idea is to locate something less abstract, about which we can actually have information-theoretic guarantees: bitstrings.

šŸŸ£ misato ā€” ā€¦just raw bitstrings?

šŸŸ” ritsuko ā€” thatā€™s right. the idea here is kinda like doing an incantation, except the incantation weā€™re locating is a very large piece of data which is unlikely to be replicated outside of this world. imagine generating a very large (several gigabytes) file, and then asking the AI ā€œlook for things of information, in the set of all computations, which look like that pattern.ā€ we call ā€œblobsā€ such bitstrings serving as *anchors into to find our world and location-within-it in the set of possible world-states and locations-within-them.

7. blob location

šŸŸ” ritsuko ā€” for example, letā€™s say the universe is a conwayā€™s game of life. then, the AI could have a set of hypotheses as programs which take as input the entire state of the conwayā€™s game of life grid at any instant, and returning a bitstring which must be equal to the blob.

šŸŸ” ritsuko ā€” first, we define (uppercase omega, a set of lowercase omega) as the set of ā€œworld-statesā€ ā€” states of the grid, defined as the set of cell positions whose cell is alive.

šŸŸ¢ shinji ā€” whatā€™s and ?

šŸŸ” ritsuko ā€” is the set of pairs whose elements are both a member of , the set of relative integers. so is the set of pairs of relative integers ā€” that is, grid coordinates. then, is the set of subsets of . finally, is the size of set ā€” requiring that is akin to requiring that is a finite set, rather than infinite. letā€™s also define:

  • as the set of booleans

  • as the set of finite bitstring

  • is the set of bitstrings of length

  • is the length of bitstring

šŸŸ” ritsuko ā€” what do you think ā€œlocate blob in world-state ā€ could look like, mathematically?

šŸŸ£ misato ā€” letā€™s see ā€” i can use the set of bitstrings of same length as , which is . letā€™s build a set of

šŸŸ¢ shinji ā€” wait, is the set of functions from to . but we were talking about programs from to . is there a difference?

šŸŸ” ritsuko ā€” this is a very good remark, shinji! indeed, we need to do a bit more work; for now weā€™ll just posit that for any sets , is the set of always-halting, always-succeeding programs taking as input an and returning a .

šŸŸ£ misato ā€” letā€™s see ā€” what about ?

šŸŸ” ritsuko ā€” youā€™re starting to get there ā€” this is indeed the set of programs which return when taking as input. however, itā€™s merely a set ā€” itā€™s not very useful as is. what weā€™d really want is a distribution over such functions. not only would this give a weight to different functions, but summing over the entire distribution could also give us some measure of ā€œhow easy it is to find in . remember the definition of distributions, ?

šŸŸ¢ shinji ā€” oh, i remember! itā€™s the set of functions in which sum up to at most one over all of .

šŸŸ” ritsuko ā€” indeed! so, weā€™re gonna posit what iā€™ll call kolmogorov simplicity, , which is like kolmogorov complexity except that itā€™s a distribution, never returns 0 nor 1 for a single element, and importantly it returns something like the inverse of complexity. it gives some amount of ā€œmassā€ to every element in some (countable) set .

šŸŸ£ misato ā€” oh, i know then! the distribution, for each , must return

šŸŸ” ritsuko ā€” thatā€™s right! we can start to define as the function that takes as input a pair of world-state and blob of length , and returns a distribution over programs that ā€œfindā€ in . plus, since functions are weighed by their kolmogorov simplicity, for complex ā€™s theyā€™re ā€œencouragedā€ to find the bits of complexity of in , rather than those bits of complexity being contained in itself.

šŸŸ” ritsuko ā€” note also that this distribution over returns, for any function , either or , which entails that for any given , the sum of for all ā€™s sums up to less than one ā€” that sum represents in a sense ā€œhow hard it is to find in ā€ or ā€œthe probability that is somewhere in ā€.

šŸŸ” ritsuko ā€” the notation here, is because returns a distribution , which is itself a function ā€” so we apply to , and then we sample the resulting distribution on .

šŸŸ¢ shinji ā€” ā€œthe sum representsā€? what do you mean by ā€œrepresentsā€?

šŸŸ” ritsuko ā€” well, itā€™s the concept which iā€™m trying to find a ā€œtrue nameā€ for, here. ā€œhow much is the blob located in world-state ? well, as much of the sum of the kolmogorov simplicity of every program that returns when taking as input ā€.

šŸŸ£ misato ā€” and then what? i feel like my understanding of how this ties into anything is still pretty loose.

šŸŸ” ritsuko ā€” so, weā€™re actually gonna get two things out of : weā€™re gonna get how much contains (as the sum of for all ā€™s), but weā€™re also gonna get how to get another world-state that is like , except that is replaced with something else.

šŸŸ¢ shinji ā€” how are we gonna get that??

šŸŸ” ritsuko ā€” hereā€™s my idea: weā€™re gonna make return not just but rather ā€” a pair of the blob of a ā€œfree bitstringā€ (tau) which it can use to store ā€œeverything in the world-state except ā€. and weā€™ll also sample programs which ā€œput the world-state back togetherā€ given the same free bitstring, and a possibly different counterfactual blob than .

šŸŸ£ misato ā€” so, for , is defined as something likeā€¦

šŸŸ¢ shinji stares at the math for a while ā€” actually, shouldnā€™t the statement be more general? you donā€™t just want to work on , you want to work on any other blob of the same length.

šŸŸ” ritsuko ā€” thatā€™s correct shinji! letā€™s call the original blob the ā€œfactual blobā€, letā€™s call other blobs of the same length we could insert in its stead ā€œcounterfactual blobsā€ and write them as ā€” we can establish that (prime) will denote counterfactual things in general.

šŸŸ£ misato ā€” so itā€™s more likeā€¦

šŸŸ£ misato ā€” ā€¦ should equal, exactly?

šŸŸ” ritsuko ā€” we donā€™t know what it should equal, but we do know something about what it equals: should work on that counterfactual and find the same counterfactual blob again.

šŸŸ” ritsuko ā€” actually, letā€™s make be merely a distribution over functions that produce counterfactual world-states from counterfactual blobs ā€” letā€™s call those ā€œcounterfactual insertion functionsā€ and denote them and their set (gamma) ā€” and weā€™ll encapsulate away from the rest of the math:

šŸŸ¢ shinji ā€” isnā€™t a bit circular?

šŸŸ” ritsuko ā€” well, yes and no. it leaves a lot of degrees of freedom to and , perhaps too much. letā€™s say we had some function ā€” letā€™s not worry about how it works. then could weigh each ā€œblob locationā€ by how much counterfactual world-states are similar, when sampled over all counterfactual blobs.

šŸŸ£ misato ā€” maybe we should also constrain the programs for how long they take to run?

šŸŸ” ritsuko ā€” ah yes, good idea. letā€™s say that for and , is how long it takes to run program on input , in some amount of steps each doing a constant amount of work ā€” such as steps of compute in a turing machine.

šŸŸ” ritsuko ā€” (iā€™ve also replaced with since thatā€™s shorter and theyā€™re equal anyways)

šŸŸ£ misato ā€” where does the first sum end, exactly?

šŸŸ” ritsuko ā€” it applies to the wholeā€“ oh, you know what, i can achieve the same effect by flattening the whole thing into a single sum. and renaming the in to to avoid confusion.

šŸŸ¢ shinji ā€” are we still operating in conwayā€™s game of life here?

šŸŸ” ritsuko ā€” oh yeah, now might be a good time to start generalizing. weā€™ll carry around not just world-states , but initial world-states (alpha). those are gonna determine the start of universes ā€” distributions of world-states being computed-over-time ā€” and weā€™ll use them when weā€™re computing world-states forwards or comparing the age of world-states. for example probably needs this, so weā€™ll need to pass it to which will now be of type :

8. constrained mass notation

šŸŸ¢ shinji ā€” i notice that youā€™re multiplying together your ā€œkolmogorov simplicitiesā€ and and now divided by a sum of how long they take to run. whatā€™s going on here exactly?

šŸŸ” ritsuko ā€” well, each of those number is a ā€œconfidence amountā€ ā€” scalars between 0 and 1 that say ā€œhow much does this iteration of the sum capture the thing we wantā€, like probabilities. multiplication is like the logical operator ā€œandā€ except for confidence ratios, you know.

šŸŸ¢ shinji ā€” ah, i see. so these sums do something kinda like ā€œexpected valueā€ in probability?

šŸŸ” ritsuko ā€” something kinda like that. actually, this notation is starting to get unwieldy. iā€™m noticing a bunch of this pattern:

šŸŸ£ misato ā€” so, if you want to use the standard probability theory notations, you need random variables whichā€“

šŸŸ” ritsuko ā€” ugh, i donā€™t like random variables, because the place at which they get substituted for the sampled value is ambiguous. here, iā€™ll define my own notation:

šŸŸ” ritsuko ā€” will stand for ā€œconstrained massā€, and itā€™s basically syntactic sugar for sums, where means ā€œsum over (where returns the set of arguments over which a function is defined), and then multiply each iteration of the sum by ā€. now, we just have to define uniform distributions over finite sets asā€¦

šŸŸ¢ shinji ā€” for finite set ?

šŸŸ” ritsuko ā€” thatā€™s it! and now, is much more easily written down:

šŸŸ¢ shinji ā€” huh. you know, iā€™m pretty skeptical of you inventing your own probability notations, but this is much more readable, when you know what youā€™re looking at.

šŸŸ£ misato ā€” so, are we done here? is this blob location?

šŸŸ” ritsuko ā€” well, i expect that some thing are gonna come up later that are gonna make us want to change this definition. but right now, the only improvement i can think of is to replace and with .

šŸŸ£ misato ā€” huh, whatā€™s the difference?

šŸŸ” ritsuko ā€” well, now weā€™re sampling from kolmogorov simplicity at the same time, which means that if there is some large piece of information that they both use, they wonā€™t be penalized for using it twice but only once ā€” a tuple containing two elements which have a lot of information in common only has that information counter once by .

šŸŸ£ misato ā€” and we want that?

šŸŸ” ritsuko ā€” yes! there are some cases where weā€™d want two mathematical objects to have a lot of information in common, and other places where weā€™d want them to not need to be dissimilar. here, it is clearly the former: we want the program that ā€œdeconstructsā€ the world-state into blob and everything-else, and the function that ā€œreconstructsā€ a new world-state from a counterfactual blob and the same everything-else, to be able to share information as to how they do that.

9. what now?

šŸŸ¢ shinji ā€” so weā€™ve put together a true name for ā€œpiece of data in the universe which can be replaced with counterfactualsā€. thatā€™s pretty nifty, i guess, but what do we do with it?

šŸŸ” ritsuko ā€” now, this is where the core of my idea comes in: in the physical world, weā€™re gonna create a random unique enough blob on someoneā€™s computer. then weā€™re going to, still in the physical world, read its contents right after generating it. if it looks like a counterfactual (i.e. if it doesnā€™t look like randomness) weā€™ll create another blob of data, which can be recognized by as an answer.

šŸŸ¢ shinji ā€” what does that entail, exactly?

šŸŸ” ritsuko ā€” weā€™ll have created a piece of real, physical world, which lets use use to get the true name, in pure math, of ā€œwhat answer would that human person have produced to this counterfactual question?ā€

šŸŸ£ misato ā€” hold on ā€” we already have this. the AI can already have an interface where it asks a human user something, and waits for our answer. and the problem with that is that, obviously, the AI hijacks us or its interface to get whatever answer makes its job easiest.

šŸŸ” ritsuko ā€” aha, but this is different! we can point at a counterfactual question-and-answer chunk-of-time (call it ā€œquestion-answer counterfactual intervalā€, or ā€œQACIā€) which is before the AIā€™s launch, in time. we can mathematically define it as being in the past of the AI, by identifying the AI with some other blob which weā€™ll also locate using , and demand that the blob identifying the AI be causally after the userā€™s answer.

šŸŸ£ misato ā€” huh.

šŸŸ” ritsuko ā€” thatā€™s another idea i got from PreDCA ā€” making the AI pursue the values of a static version of its user in its past, rather than its user-over-time.

šŸŸ¢ shinji ā€” but we donā€™t want the AI to lock-in our values, we want the AI to satisfy our values-as-they-evolve-over-time, donā€™t we?

šŸŸ£ misato ā€” well, shinji, thereā€™s multiple ways to phrase your mistake, here. one is that, actually, you do ā€” but if youā€™re someone reasonable, then the values you endorse are some metaethical system which is able to reflect and learn about whatā€™s good, and to let people and philosophy determine what can be pursued.

šŸŸ£ misato ā€” but you do have values you want to lock in. your meta-values, your metaethics, you donā€™t want those to be able to change arbitrarily. for example, you probly donā€™t want to be able to become someone who wants everyone to maximally suffer. those endorsed, top-level, metaethics meta-values, are something you do want to lock in.

šŸŸ” ritsuko ā€” put it another way: if youā€™re reasonable, then if the AI asks you what you want inside the question-answer counterfactual interval, you wonā€™t answer ā€œi want everyone to be forced to watch the most popular TV show in 2023ā€. youā€™ll answer something more like ā€œi want everyone to be able to reflect on their own values and choose what values and choices they endorse, and how, and that the field of philosophy can continue in these ways in order to figure out how to resolve conflictsā€, or something like that.

šŸŸ£ misato ā€” wait, if the AI is asking the user counterfactual questions, wonā€™t it ask the user whatever counterfactual question brainhacks the user into responding whatever answer makes its job easiest? it can just hijack the QACI.

šŸŸ” ritsuko ā€” aha, but we donā€™t have to have the AI formulate answers! we could do something like: make the initial question some static question like ā€œplease produce an action that saves the worldā€, and then the user thinks about it for a bit, returns an answer, and that answer is fed back into another QACI to the user. this loops until one of the user responds with an answer which starts with a special string like ā€œokay, iā€™m done for sure:ā€, followed by a bunch of text which the AI will interpret as a piece of math describing a scoring over actions, and itā€™ll try to output a utility function which maximizes that.

šŸŸ¢ shinji ā€” so itā€™s kinda like coherent extrapolated volition but for actions?

šŸŸ” ritsuko ā€” sure, i think of it as an implementation of CEV. it allows its user to run a long-reflection process. actually, that long-reflection process even has the ability to use a mathematical oracle.

šŸŸ£ misato ā€” how does that work?

10. blob signing & closeness in time

šŸŸ” ritsuko ā€” so, letā€™s define as a function, and thisā€™ll clarify whatā€™s going on. will be our initial random factual question blob. takes as parameter a blob location for the question ā€” which, remember, comes in the form of a function you can use to produce counterfactual world-states with counterfactual blobs! ā€” and a counterfactual question blob , and returns a distribution of possible answers . itā€™s defined as:

šŸŸ” ritsuko ā€” weā€™re, for now just positing, that there is a function (remember that defines a hypothesis for the initial state, and mechanics, of our universe) which, given a world-state, returns a distribution of world-states that are in its future. so this piece of math samples possible future world-states of the counterfactual world-state where was replaced with , and possible locations of possible answers in those world-states.

šŸŸ£ misato ā€” ? what does that mean?

šŸŸ” ritsuko ā€” here, the fact that doesnā€™t necessarily sum to 1 ā€” we say that it doesnā€™t normalize ā€” means that summed up over all can be less than 1. in fact, this sum will indicate ā€œhow hard is it to find the answer in futures of counterfactual world-states ?ā€ ā€” and uses that as the distribution of answers.

šŸŸ£ misato ā€” hmmm. wait, this just finds whichever-answers-are-the-easiest-to-find. what guarantees that looks like an answer at all?

šŸŸ” ritsuko ā€” this is a good point. maybe we should define something like which, to any input ā€œpayloadā€ of a certain length, associates a blob which is actually highly complex, because embeds a lot of bits of complexity. for example, maybe (where is the ā€œpayloadā€) concatenates together with a long cryptographic hash of and of some piece of information highly entangled with our world-state.

šŸŸ¢ shinji ā€” weā€™re not signing the counterfactual question , only the answer payload ?

šŸŸ” ritsuko ā€” thatā€™s right. signatures matter for blobs weā€™re finding; once weā€™ve found them, we donā€™t need to sign counterfactuals to insert in their stead.

šŸŸ£ misato ā€” so, it seems to me like how works here, is pretty critical. for example, if it contains a bunch of mass at world-states where some AI is launched, whether ours or another, then that AI will try to fill its future lightcone with answers that would match various ā€™s ā€” so that our AI would find those answers instead of ours ā€” and make those answers be something that maximize their utility function rather than ours.

šŸŸ” ritsuko ā€” this is true! indeed, how we sample for is pretty critical. how about this: first, weā€™ll pass the distribution into :

šŸŸ” ritsuko ā€” ā€¦and inside , which is now of type , for any weā€™ll only sample world-states which have the highest mass in that distribution:

šŸŸ” ritsuko ā€” the intent here is that for any way-to-find-the-blob , we only sample the closest matching world-states in time ā€” which does rely on having higher mass for world-states that are closer in time. and hopefully, the result is that we pick enough instances of the signed answer blobs located shortly in time after the question blobs, that theyā€™re mostly dominated by the human user answering them, rather than AIs appearing later.

šŸŸ£ misato ā€” can you disentangle the line where you sample ?

šŸŸ” ritsuko ā€” sure! so, we write an anonymous function ā€” a distribution is a function, after all! ā€” taking a parameter from the set , and returning . so this is going to be a distribution that is just like , except itā€™s only defined for a subset of ā€” those in .

šŸŸ” ritsuko ā€” in this case, is defined as such: first, take the set of elements for which . then, apply the distribution to all of them, and only keep elements for which they have the most (there can be multiple, if multiple elements have the same maximum mass!).

šŸŸ” ritsuko ā€” oh, and i guess is redundant now, iā€™ll erase it. remember that this syntax means ā€œsum over the body for all values of for which these constraints holdā€¦ā€, which means we can totally have the value of be bound inside the definition of like this ā€” itā€™ll just have exactly one value for any pair of and .

11. QACI graph

šŸŸ¢ shinji ā€” why is returning a distribution over answers, rather than picking the single element with the most mass in the distribution?

šŸŸ” ritsuko ā€” thatā€™s a good question! in theory, it could be that, but we do want the user to be able to go to the next possible counterfactual answer if the first one isnā€™t satisfactory, and the one after that if thatā€™s still not helpful, and so on. for example: in the piece of math which will interpret the userā€™s final result as a math expression, we want to ignore answers which donā€™t parse or evaluate as proper math of the intended type.

šŸŸ¢ shinji ā€” so the AI is asking the counterfactual past-user-in-time to come up with a good action-scoring function inā€¦ however long a question-answer counterfactual interval is.

šŸŸ” ritsuko ā€” letā€™s say about a week.

šŸŸ¢ shinji ā€” and this helpsā€¦ how, again?

šŸŸ” ritsuko ā€” well. first, letā€™s posit , which tries to parse and evaluate a bitstring representing a piece of math (in some pre-established formal language) and returns either:

  • what it evaluates to if it is a member of

  • an empty set if it isnā€™t a member of or fails to parse or evaluate

šŸŸ” ritsuko ā€” we then define as a function that returns the highest-mass element of the distribution for which returns a value rather than the empty set. weā€™ll also assume for convenience , a convenience function which converts any mathematical object into a counterfactual blob . this isnā€™t really allowed, but itā€™s just for the sake of example here.

šŸŸ£ misato ā€” okayā€¦

šŸŸ” ritsuko ā€” so, letā€™s say the first call is . the user can return any expression, as their action-scoring function ā€” they can return (a function taking an action and returning some utility measure over it), but they can also return where is the set of action-scoring functions. they get to call themselves recursively, and make progress in a sort of time-loop where they pass each other notes.

šŸŸ£ misato ā€” right, this is the long-reflection process you mentioned. and about the part where they get a mathematical oracle?

šŸŸ” ritsuko ā€” so, the user can return things like:

.

šŸŸ£ misato ā€” huh. thatā€™s nifty.

šŸŸ¢ shinji ā€” what if some weird memetic selection effects happen, or what if in one of the QACI intervals, the user randomly gets hit by a truck and then the whole scheme fails?

šŸŸ” ritsuko ā€” so, the user can set up giant giant acyclic graphs of calls to themselves, providing a lot of redundancy. that way, if any single node fails to return a coherent output, the next nodes can notice this and keep working with their peerā€™s output.

šŸŸ” ritsuko ā€” a small graph of QACI can be accomplished with something like

šŸŸ” ritsuko ā€” and you can imagine how this could be extended to a larger structure such as:

12. fictional agents

šŸŸ¢ shinji ā€” wait a minute; what if the QACI user goes crazy from realizing that theyā€™re fictional, inside an AIā€™s hypothesis, rather than in the physical world-state? especially if they have access to a goddamn math oracle.

šŸŸ” ritsuko ā€” ah. uuh. well, donā€™t panic shinji, but, we are fictional. right now.

šŸŸ¢ shinji starts flailing around.

šŸŸ£ misato ā€” what the hell are you talking about?

šŸŸ” ritsuko ā€” misato, come on. weā€™re inside a crazy high-tech compound designed to build and maintain giant robots to fight alien creatures, and weā€™re discussing the entirely unrelated topic of AI alignment. oh, and, uh, thereā€™s this:

šŸ”µ rei appears in front of them and waves at the camera.

šŸŸ£ misato ā€” ā€¦so, why are we talking about alignment, instead of, i donā€™t know, trying to escape?

šŸŸ” ritsuko ā€” misato, this is fiction, you donā€™t escape. instead, you reason about what agents are writing your fiction. based on prior, we should vaguely guess that theyā€™re like us, and that the conversation weā€™re having is useful for them to write a fiction about. my guess is, weā€™re a fiction written by someone whoā€™s trying to communicate ideas about AI alignment and, for some godforsaken reason i cannot imagine, decided that we should be the characters having this conversation.

šŸŸ” ritsuko ā€” but donā€™t be mistaken: if we are indeed fictional, then the future of this world weā€™re observing barely matters. our parent world likely holds a lot more realityfluid. so we shouā€“

šŸŸ¢ shinji, shaking ā€” here comes this ā€œrealityfluidā€ bullshit again!

šŸŸ” ritsuko ā€” anyways, the coherent thing for fictional characters to do is to try to steer their parent world, because itā€™s generally ā€œrealerā€. and generally, this is achieved by cooperating with whatever their parent world seems to be doing, so we should probably just continue to discuss AI alignment.

šŸŸ” ritsuko ā€” and, in a similar manner, there are people who grok these facts sufficiently well that we would expect them to take the correct actions, even upon realizing that they are the fictional consideration of a superintelligence. given the explanation i just gave, i would claim that i could do it myself, for one.

šŸ”µ rei magically erases everyoneā€™s memory of this section of the dialogue, and then disappears, out of view from everyone.

13. top-level QACI call

šŸŸ¢ shinji ā€” so we have , the function which can call itself recursively. whatā€™s the top-level, terminal call to it which yields the action-scoring function?

šŸŸ” ritsuko ā€” ah, i think itā€™d look like:

šŸŸ” ritsuko ā€” where is some initial counterfactual blob, such as the plaintext string ā€œplease return a good scoring function over actionsā€ encoded in ASCII, and then padded with zeros to be of the size needed for a blob. has type ā€” from a question location, it returns a distribution of action-scoring functions.

šŸŸ£ misato ā€” so like, the counterfactual user inside the call should be able to return math that calls more , but where do they get the and ?

šŸŸ¢ shinji ā€” couldnā€™t they return the whole math?

šŸŸ” ritsuko ā€” ah, thatā€™s not gonna work ā€” the chance of erroneous blob locations might accumulate too much if each does a new question location sampling; we want something more realiable. an easy solution is to the text not into a , but into a and to pass it so that the user can return a function which receives those and uses them to call .

šŸŸ” ritsuko ā€” actually, while weā€™re at it, we can pass a it whole lot more things it might needā€¦

šŸŸ¢ shinji ā€” whatā€™s going on with here?

šŸŸ” ritsuko ā€” oh, this is just a trick of how we implement distributions ā€” when measuring the mass of any specific , we try to the answer payload into a function , and we only count the location when is equal to with useful parameters passed to it.

šŸŸ£ misato ā€” whatā€™s around ? where do and come from?

šŸŸ” ritsuko ā€” soā€¦ remember this?

šŸŸ” ritsuko ā€” this is where we start actually plugging in our various parts. weā€™ll assume some distribution over initial world-states and sample question locations in futures of those initial world-states ā€” which will serve, for now, as the .

šŸŸ” ritsuko ā€” the actual AI we use will be of a type like , and so we can just call , and execute its action guess.

šŸŸ£ misato ā€” andā€¦ thatā€™s it?

šŸŸ” ritsuko ā€” well, no. i mean, the whole fundamental structure is here, but thereā€™s still a bunch of work we should do if we want to increase the chances that this produces the outcomes we want.

14. location prior

šŸŸ” ritsuko ā€” so, right now each call to penalizes for being being too kolmogorov-complex. we could take advantage of this by encouraging our two different blob locations ā€” the question location and the answer location ā€” to share bits of information, rather than coming up with their own, possibly different bits of information. this increases the chances that the question is located ā€œin a similar wayā€ to the answer.

šŸŸ£ misato ā€” what does this mean, concretely?

šŸŸ” ritsuko ā€” well, for example, they could have the same bits of information for how to find bits of memory on a computerā€™s memory on earth, encoded in our physics, and then the two different ā€™s and functions would only differ in what computer, what memory range, and what time they find their blobs in.

šŸŸ” ritsuko ā€” for this, weā€™ll define a set of ā€œlocation priorsā€ being sampled as part of the hypothesis that samples over ā€” letā€™s call it (xi). we might as well posit .

šŸŸ” ritsuko ā€” weā€™ll also define a kolmogorov simplicity measure which can use another piece of information, as, letā€™s seeā€¦

šŸŸ” ritsuko ā€” there we go, measuring the simplicity of the pair of the prior and the element favors information being shared between them.

šŸŸ£ misato ā€” wait, this fails to normalize now, doesnā€™t it? because not all of is sampled, only pairs whose first element is .

šŸŸ” ritsuko ā€” ah, youā€™re right! we can simply normalize this distribution to solve that issue.

šŸŸ” ritsuko ā€” and in weā€™ll simply add and then pass around to all blob locations:

šŸŸ” ritsuko ā€” finally, weā€™ll use it in to sample from:

15. adjusting scores

šŸŸ” ritsuko ā€” hereā€™s an issue: currently in , weā€™re weighing hypotheses by how hard it is to find both the question and the answer.

šŸŸ” ritsuko ā€” do you think thatā€™s wrong?

šŸŸ£ misato ā€” i think we should first ask for how hard it is to find questions, and then normalize the distribution of answers, so that harder-to-find answers donā€™t penalize hypotheses. the reasoning behind this is that we want QACI graphs to be able to do a lot of complicated things, and that we hope question location is sufficient to select what we want already.

šŸŸ” ritsuko ā€” ah, that makes sense, yeah! thankfully, we can just normalize right around the call to , before applying it to :

šŸŸ¢ shinji ā€” what happens if we donā€™t get the blob locations we want, exactly?

šŸŸ” ritsuko ā€” well, it depends. there are two kinds of ā€œblob mislocationsā€: ā€œnaiveā€ and ā€œadversarialā€ ones. naive mislocations are hopefully not a huge deal; considering that weā€™re doing average scoring over all scoring functions weighed by mass, hopefully the ā€œsignalā€ from our aligned scoring functions beats out the ā€œnoiseā€ from locations that select the wrong thing at a random place, like ā€œboltzmann blobsā€.

šŸŸ” ritsuko ā€” adversarial blobs, however, are tougher. i expect that they mostly result from unfriendly alien superintelligences, as well as earth-borne AI, both unaligned ones and ones that might result from QACI. against those, i hope that inside QACI we come up with some good decision theory that lets us not worry about that.

šŸŸ£ misato ā€” actually, didnā€™t someone recently publish some work on a threat-resistant utility bargaining function, called ā€œRoseā€?

šŸŸ” ritsuko ā€” oh, nice! well in that case, if is of type , then we can simply wrap it around all of :

šŸŸ” ritsuko ā€” note that weā€™re putting the whole thing inside an anonymous -function, and assigning to the result of applying to that distribution.

16. observations

šŸŸ¢ shinji ā€” you know, i feel like there ought to be some better ways to select hypotheses that look like our world.

šŸŸ” ritsuko ā€” hmmm. you know, i do feel like if we had some ā€œobservationā€ bitstring (mu) which strongly identifies our world, like a whole dump of wikipedia or something, that might help ā€” something like . but how do we tie that into the existing set of variables serving as a sampling?

šŸŸ£ misato ā€” we could look for the question in futures of the observation world-stateā€“ how do we get that world-state again?

šŸŸ” ritsuko ā€” oh, if youā€™ve got you an reconstitute the factual observation world-state with .

šŸŸ£ misato ā€” in that case, we can just do:

šŸŸ” ritsuko ā€” oh, neat! actually, couldnā€™t we generate two blobs and sandwich the question blob between the two?

šŸŸ£ misato ā€” letā€™s see here, the second observation can be ā€¦

šŸŸ£ misato ā€” how do i sample the location from both the future of and the past of ?

šŸŸ” ritsuko ā€” well, iā€™m not sure we want to do that. remember that tries to find the very first matching world-state for any . instead, how about this:

šŸŸ” ritsuko ā€” itā€™s a bit hacky, but we can simply demand that ā€œthe world-state be in the future of the world-state more than the world-state is in the future of the world-stateā€.

šŸŸ£ misato ā€” huh. i guess thatā€™sā€¦ one way to do it.

šŸŸ¢ shinji ā€” could we encourage the blob location prior to use the bits of information from the observations? something likeā€¦

šŸŸ” ritsuko ā€” nope. because then, ā€™s programs can simply return the observations as constants, rather than finding them in the world, which defeats the entire purpose.

šŸŸ£ misato ā€” ā€¦so, whatā€™s in those observations, exactly?

šŸŸ” ritsuko ā€” well, is mostly just going to be with ā€œmore, newer contentā€. but the core of it, , could be a whole lot of stuff. a dump of wikipedia, a callable of a some LLM, whatever else would let it identify our world.

šŸŸ¢ shinji ā€” canā€™t we just, like, plug the AI into the internet and let it gain data that way or something?

šŸŸ” ritsuko ā€” so thereā€™s like obvious security concerns here. but, assuming those were magically fixed, i can see a way to do that: could be a function or mapping rather than a bitstring, and while the AI would observe it as a constant, it could be lazily evaluated. including, like, could be a fully memoized function ā€” such that the AI canā€™t observe any mutable state ā€” but it would still point to the world. in essence, this would make the AI point to the entire internet as its observation, though of course it would in practice be unable to obtain all of it. but it could navigate it just as if it was a mathematical object.

šŸŸ£ misato ā€” interesting. though of course, the security concerns make this probably unviable.

šŸŸ” ritsuko ā€” hahah. yeah. oh, and we probably want to pass inside :

17. where next

šŸŸ£ misato ā€” so, is that it then? are we done?

šŸŸ” ritsuko ā€” hardly! i expect that thereā€™s a lot more work to be done. but this is a solid foundation, and direction to explore. itā€™s kind of the only thing that feels like a path to saving the world.

šŸŸ¢ shinji ā€” you know, the math can seem intimidating at first, but actually itā€™s not that complicated. one can figure out this math, especially if they get to ask questions in real time to the person who invented that math.

šŸŸ” ritsuko ā€” for sure! it should be noted that iā€™m not particularly qualified at this. my education isnā€™t in math at all ā€” i never really did math seriously before QACI. the only reason why iā€™m making the QACI math is that so far barely anyone else will. but iā€™ve seen at least one other person try to learn about it and come to understand it somewhat well.

šŸŸ¢ shinji ā€” what are some directions which you think are worth exploring, for people who want to help improve QACI?

šŸŸ” ritsuko ā€” oh boy. well, here are some:

  • find things that are broken about the current math, and ideally help fix them too.

  • think about utility function bargaining more ā€” notably, perhaps scores are regularized, such as maybe by weighing ratings that are more ā€œextremeā€ (further away from ) as less probable. alternatively, maybe scoring functions have a finite amount of ā€œvotestuffā€ that they get to distribute amongst all options the way a normalizing distribution does, or maybe we implement something kinda like quadratic voting?

  • think about how to make a lazily evaluated observation viable. iā€™m not sure about this, but it feels like the kind of direction that might help avoid unaligned alien AIs capturing our locations by bruteforcing blob generation using many-worlds.

  • generally figure out more ways to ensure that the blob locations match the world-states we want ā€” both by improving and , and by finding more clever ways to use them ā€” you saw how easy it was to add two blob locations for the two observations .

  • think about turning this scheme into a continuous rather than one-shot AI. (possibly exfohazardous, do not publish)

  • related to that, think about ways to make the AI aligned not just with regards to its guess, but also with regards to its side-effects, so as to avoid it wanting to exploit its way out. (possibly exfohazardous, do not publish)

  • alternatively, think about how to box the AI so that the output with regards to which it is aligned is its only meaningful source of world-steering.

  • one thing we didnā€™t get into much is what could actually be behind , , and . you can read more about those here, but i donā€™t have super strong confidence in the way theyā€™re currently put together. in particular, it would be great if someone who groks physics a lot more than me thought about whether many-worlds gives unaligned alien superintelligences the ability to forge any blob or observation we could put together in a way that would capture our AIā€™s blob location.

  • maybe there are some ways to avoid this by tying the question world-state with the AIā€™s action world-state? maybe implementing embedded agency helps with this? note that blob location can totally locate the AIā€™s action, and use that to produce counterfactual action world-states. maybe that is useful. (possibly exfohazardous, do not publish)

  • think about and the function (see the full math post) and how to either implement it or achieve a similar effect otherwise. for example, maybe instead of relying on an expensive hash, we can formally define that need to be ā€œconsequentialist agents trying to locate the blob in the way we wantā€, rather than any program that works.

  • think about how to make counterfactual QACI intervals resistant to someone launching unaligned superintelligence within them.

šŸŸ£ misato ā€” ack, i didnā€™t really think of that last one. yeah, that sounds bad.

šŸŸ” ritsuko ā€” yup. in general, i could also do with people who could help with inner-alignment-to-a-formal-goal, but thatā€™s a lot more hazardous to work on. hence why we have not talked about it. but there is work to be done on that front, and people who think they have insights should probly contact us privately and definitely not publish them. interpretability people are doing enough damage to the world as it is.

šŸŸ¢ shinji ā€” well, things donā€™t look great, but iā€™m glad this plan is around! i guess itā€™s something.

šŸŸ” ritsuko ā€” i know right? thatā€™s how i feel as well. lol.

šŸŸ£ misato ā€” lmao, even.